K-State Honor Code:

On my honor, as a student, I have neither given nor received unauthorized aid on this academic work." A grade of XF can result from a breach of academic honesty

Your name: Mohammad Najjartabar Bisheh

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://i.pinimg.com/originals/db/4f/88/db4f88f155d22599f59765e14f4c5497.jpg")
Out[1]:

1. Business understanding:

Movies are an important part of human's life these days. We can name movies as the cheapest and most available fun that everyone can have and it is not depended to how wealthy we are. A great movie can entertain almost everyon on this planet. But what makes a movie great? A good story? Great characters? cinematography and actors' performance? The answer is yes, all of these can impact on the quality of a movie. The important question is how we can measure quality and perfomrance of a movie?

Quality and performance of a movie can be measured in two point of view. First of all and as a general idea, critics review's can be a good refference to judge about quality of a movie. Critics will consider quality of a movie by their own criteria (e.g. Directing, Writing, Cinematography, Editing, Acting, Production Design, Sound and etc.). However to talk about performance of a movie, "perceived performance of audience" plays an important rule. To measure "perceived performance of audience" maybe we can simply look at amount of money that movie made but what have impact on the sell of a movie?

Todays, there are some websites (like IMDB) which does movie rating usually base on peoples' weighted vote. In this project, (based on some given information like genre, budget, duration, and ...) I want to see what is impact of each (and maybe combination of some) factor on IMDB rating which can assume as "perceived performance of audience".

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# import the scatter_matrix functionality
from pandas.plotting import scatter_matrix
import plotly.graph_objects as go
import plotly.express as px
import pingouin as pg
import statsmodels.api as sm
from statsmodels.formula.api import ols

#regression packages
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score

#lasso regression
from sklearn import linear_model

#f_regression (feature selection)
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest

# recursive feature selection (feature selection)
from sklearn.feature_selection import RFE


from sklearn.cluster import KMeans

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances


# Classifiers
#import decisiontreeclassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
#import logisticregression classifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
#import knn classifier
from sklearn.neighbors import KNeighborsClassifier

#for validating your classification model
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score

# feature selection
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# grid search
from sklearn.model_selection import GridSearchCV


from sklearn.ensemble import RandomForestClassifier

2. Data understanding: The dataset contains a large number of variables with different types (e.g., numerical, categorial). Provide a brief summary of data understanding. Specifically, you need to:

In [3]:
# Reading dataset
df = pd.read_csv('movie_metadata.csv\movie_metadata.csv')    
df.head()
Out[3]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

Just with a look at the head of data we can see there are NaN and zero values which means missing data. df.info proves this that we are missing many values and data needs to be cleaned.

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color                        5024 non-null object
director_name                4939 non-null object
num_critic_for_reviews       4993 non-null float64
duration                     5028 non-null float64
director_facebook_likes      4939 non-null float64
actor_3_facebook_likes       5020 non-null float64
actor_2_name                 5030 non-null object
actor_1_facebook_likes       5036 non-null float64
gross                        4159 non-null float64
genres                       5043 non-null object
actor_1_name                 5036 non-null object
movie_title                  5043 non-null object
num_voted_users              5043 non-null int64
cast_total_facebook_likes    5043 non-null int64
actor_3_name                 5020 non-null object
facenumber_in_poster         5030 non-null float64
plot_keywords                4890 non-null object
movie_imdb_link              5043 non-null object
num_user_for_reviews         5022 non-null float64
language                     5031 non-null object
country                      5038 non-null object
content_rating               4740 non-null object
budget                       4551 non-null float64
title_year                   4935 non-null float64
actor_2_facebook_likes       5030 non-null float64
imdb_score                   5043 non-null float64
aspect_ratio                 4714 non-null float64
movie_facebook_likes         5043 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB
In [5]:
# Describe data

df.describe()
Out[5]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
count 4993.000000 5028.000000 4939.000000 5020.000000 5036.000000 4.159000e+03 5.043000e+03 5043.000000 5030.000000 5022.000000 4.551000e+03 4935.000000 5030.000000 5043.000000 4714.000000 5043.000000
mean 140.194272 107.201074 686.509212 645.009761 6560.047061 4.846841e+07 8.366816e+04 9699.063851 1.371173 272.770808 3.975262e+07 2002.470517 1651.754473 6.442138 2.220403 7525.964505
std 121.601675 25.197441 2813.328607 1665.041728 15020.759120 6.845299e+07 1.384853e+05 18163.799124 2.013576 377.982886 2.061149e+08 12.474599 4042.438863 1.125116 1.385113 19320.445110
min 1.000000 7.000000 0.000000 0.000000 0.000000 1.620000e+02 5.000000e+00 0.000000 0.000000 1.000000 2.180000e+02 1916.000000 0.000000 1.600000 1.180000 0.000000
25% 50.000000 93.000000 7.000000 133.000000 614.000000 5.340988e+06 8.593500e+03 1411.000000 0.000000 65.000000 6.000000e+06 1999.000000 281.000000 5.800000 1.850000 0.000000
50% 110.000000 103.000000 49.000000 371.500000 988.000000 2.551750e+07 3.435900e+04 3090.000000 1.000000 156.000000 2.000000e+07 2005.000000 595.000000 6.600000 2.350000 166.000000
75% 195.000000 118.000000 194.500000 636.000000 11000.000000 6.230944e+07 9.630900e+04 13756.500000 2.000000 326.000000 4.500000e+07 2011.000000 918.000000 7.200000 2.350000 3000.000000
max 813.000000 511.000000 23000.000000 23000.000000 640000.000000 7.605058e+08 1.689764e+06 656730.000000 43.000000 5060.000000 1.221550e+10 2016.000000 137000.000000 9.500000 16.000000 349000.000000

From describe data we can get basic statistics information from data.

However to have a sufficient judgment better not to rely on these data only and I won't comment on it seperatly.

data quality issues

(William McKnight, in Information Management, 2014)

Sources of Poor Data Quality The following are seven sources of data quality issues.

1.Entry quality: Did the information enter the system correctly at the origin?

2.Process quality: Proper checks and quality control at each touchpoint along the path can help ensure that problems are rooted out, but these checks are often absent in legacy processes.

3.Identification quality: Data quality processes can largely eliminate this problem by matching records, identifying duplicates, and placing a confidence score4 on the similarity of records.

4.Integration quality: Is all the known information about an object integrated to the point of providing an accurate representation of the object?

5.Usage quality: Is the information used and interpreted correctly at the point of access?

6.Aging quality: Has enough time passed that the validity of the information can no longer be trusted?

7.Organizational quality: The biggest challenge to reconciliation is getting the various departments to agree that their A equals the other’s B equals the other’s C plus D.

Based on our current dataset, we need to take care of missing and duplicated values.

In [6]:
#Finding and deleting duplicated values
len(df[df.duplicated()])
Out[6]:
45
In [7]:
len(df)
Out[7]:
5043
In [8]:
# Finding and taking care of missing values
df.isnull().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color                        5043 non-null bool
director_name                5043 non-null bool
num_critic_for_reviews       5043 non-null bool
duration                     5043 non-null bool
director_facebook_likes      5043 non-null bool
actor_3_facebook_likes       5043 non-null bool
actor_2_name                 5043 non-null bool
actor_1_facebook_likes       5043 non-null bool
gross                        5043 non-null bool
genres                       5043 non-null bool
actor_1_name                 5043 non-null bool
movie_title                  5043 non-null bool
num_voted_users              5043 non-null bool
cast_total_facebook_likes    5043 non-null bool
actor_3_name                 5043 non-null bool
facenumber_in_poster         5043 non-null bool
plot_keywords                5043 non-null bool
movie_imdb_link              5043 non-null bool
num_user_for_reviews         5043 non-null bool
language                     5043 non-null bool
country                      5043 non-null bool
content_rating               5043 non-null bool
budget                       5043 non-null bool
title_year                   5043 non-null bool
actor_2_facebook_likes       5043 non-null bool
imdb_score                   5043 non-null bool
aspect_ratio                 5043 non-null bool
movie_facebook_likes         5043 non-null bool
dtypes: bool(28)
memory usage: 138.0 KB

See if there is any zeros in the table

In [9]:
criteria1 = df['budget'] == 0;
criteria2 = df['gross'] == 0;
criteria3 = df['color'] == 0;
criteria4 = df['director_name'] == 0;
criteria5 = df['num_critic_for_reviews'] == 0;
criteria6 = df['duration'] == 0;
criteria7 = df['director_facebook_likes'] == 0;
criteria8 = df['actor_3_facebook_likes'] == 0;
criteria9 = df['actor_2_name'] == 0;
criteria10 = df['actor_1_facebook_likes'] == 0;
criteria11 = df['genres'] == 0;
criteria12 = df['actor_1_name'] == 0;
criteria13 = df['genres'] == 0;
criteria14 = df['movie_title'] == 0;
criteria15 = df['num_voted_users'] == 0;
criteria16 = df['cast_total_facebook_likes'] == 0;
criteria17 = df['actor_3_name'] == 0;
criteria18 = df['facenumber_in_poster'] == 0;
criteria19 = df['plot_keywords'] == 0;
criteria20 = df['movie_imdb_link'] == 0;
criteria21 = df['num_user_for_reviews'] == 0;
criteria22 = df['language'] == 0;
criteria23 = df['content_rating'] == 0;
criteria24 = df['title_year'] == 0;
criteria25 = df['actor_2_facebook_likes'] == 0;
criteria26 = df['imdb_score'] == 0;
criteria26 = df['movie_facebook_likes'] == 0;



criteria = criteria1 & criteria2 & criteria3 & criteria4 & criteria5 & criteria6 & criteria7 & criteria8 & criteria9 & criteria10 & criteria11 & criteria12 & criteria13 & criteria14 & criteria15 & criteria16 & criteria17 & criteria18 & criteria19 & criteria20 & criteria21 & criteria22 & criteria23 & criteria24 & criteria25 & criteria26
In [10]:
# There are some missing values and marked by “”. I replace them with "na"
df['content_rating'] = df['content_rating'].str.replace('“”', 'na')
df['content_rating'] = df['content_rating'].str.replace('0', 'na')
In [11]:
missing = df.isna()
missing2 = missing.sum()
missing2 = missing2.reset_index(level=0)
missing2 = missing2.rename(columns={0: 'factor', 1: 'total'})
missing2 = missing2.set_index('index')
missing2 = missing2.sort_values('factor', ascending=False).plot(kind='bar')
# if I want to do analysis which contain gross, budget, aspect_ratio or content rating I will use dff which is df without "NA".

More than 800 movies we don't have their gross income and about 500 movies we don't have their budget. Also cosiderable missing values in aspect ratio and content rating and all of these factors are important in IMDB score analysis so we need to do something.

In [12]:
missing['actor_1_name'].describe()
Out[12]:
count      5043
unique        2
top       False
freq       5036
Name: actor_1_name, dtype: object
In [13]:
df['gross'].describe()
Out[13]:
count    4.159000e+03
mean     4.846841e+07
std      6.845299e+07
min      1.620000e+02
25%      5.340988e+06
50%      2.551750e+07
75%      6.230944e+07
max      7.605058e+08
Name: gross, dtype: float64
In [14]:
df['director_name'].value_counts().head(10)
Out[14]:
Steven Spielberg     26
Woody Allen          22
Martin Scorsese      20
Clint Eastwood       20
Ridley Scott         17
Steven Soderbergh    16
Tim Burton           16
Spike Lee            16
Renny Harlin         15
Oliver Stone         14
Name: director_name, dtype: int64
In [15]:
df['actor_1_name'].value_counts().head(10)
Out[15]:
Robert De Niro       49
Johnny Depp          41
Nicolas Cage         33
J.K. Simmons         31
Matt Damon           30
Denzel Washington    30
Bruce Willis         30
Liam Neeson          29
Steve Buscemi        27
Harrison Ford        27
Name: actor_1_name, dtype: int64
In [16]:
df['budget'].value_counts().head()
Out[16]:
20000000.0    174
15000000.0    143
25000000.0    142
30000000.0    141
10000000.0    135
Name: budget, dtype: int64

These are some general information about some selected columns to get a rough idea about them!

Data preparation

I figured there are 45 duplicated values, no empty cell but bunch of "NA" values. I am going to delete all duplicted rows. For "NA" values I won't delete all rows that contain it but I will remove from the rows that I need because I don't want to lose useful info.

In [17]:
df = df.drop_duplicates(keep='first')
len(df)
Out[17]:
4998

So we didn't have empty cell but there are NA values. First we're gonna drop a row if all cells are NA.

Then since IMDB score is the most important factor (our response here) so I drop all the row that we don't have IMDB score for it.

In [18]:
df.dropna(how='all');
df.dropna(subset=['imdb_score'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4998 entries, 0 to 5042
Data columns (total 28 columns):
color                        4979 non-null object
director_name                4895 non-null object
num_critic_for_reviews       4949 non-null float64
duration                     4983 non-null float64
director_facebook_likes      4895 non-null float64
actor_3_facebook_likes       4975 non-null float64
actor_2_name                 4985 non-null object
actor_1_facebook_likes       4991 non-null float64
gross                        4124 non-null float64
genres                       4998 non-null object
actor_1_name                 4991 non-null object
movie_title                  4998 non-null object
num_voted_users              4998 non-null int64
cast_total_facebook_likes    4998 non-null int64
actor_3_name                 4975 non-null object
facenumber_in_poster         4985 non-null float64
plot_keywords                4846 non-null object
movie_imdb_link              4998 non-null object
num_user_for_reviews         4977 non-null float64
language                     4986 non-null object
country                      4993 non-null object
content_rating               4697 non-null object
budget                       4511 non-null float64
title_year                   4891 non-null float64
actor_2_facebook_likes       4985 non-null float64
imdb_score                   4998 non-null float64
aspect_ratio                 4671 non-null float64
movie_facebook_likes         4998 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB

Our dataset in df is not constant but I don't want to lose it because we can have good analysis on it.

In [19]:
# To have a constant dataset I made dff:
dff = df.dropna()
dff.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 0 to 5042
Data columns (total 28 columns):
color                        3723 non-null object
director_name                3723 non-null object
num_critic_for_reviews       3723 non-null float64
duration                     3723 non-null float64
director_facebook_likes      3723 non-null float64
actor_3_facebook_likes       3723 non-null float64
actor_2_name                 3723 non-null object
actor_1_facebook_likes       3723 non-null float64
gross                        3723 non-null float64
genres                       3723 non-null object
actor_1_name                 3723 non-null object
movie_title                  3723 non-null object
num_voted_users              3723 non-null int64
cast_total_facebook_likes    3723 non-null int64
actor_3_name                 3723 non-null object
facenumber_in_poster         3723 non-null float64
plot_keywords                3723 non-null object
movie_imdb_link              3723 non-null object
num_user_for_reviews         3723 non-null float64
language                     3723 non-null object
country                      3723 non-null object
content_rating               3723 non-null object
budget                       3723 non-null float64
title_year                   3723 non-null float64
actor_2_facebook_likes       3723 non-null float64
imdb_score                   3723 non-null float64
aspect_ratio                 3723 non-null float64
movie_facebook_likes         3723 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 843.5+ KB
In [20]:
b = pd.DataFrame(dff.genres.str.split('|').tolist(), index=dff.imdb_score).stack()
b = b.reset_index()[[0, 'imdb_score']] # genres variable is currently labeled 0
b.columns = ['genres', 'imdb_score'] # renaming genres
b.head()
Out[20]:
genres imdb_score
0 Action 7.9
1 Adventure 7.9
2 Fantasy 7.9
3 Sci-Fi 7.9
4 Action 7.1

b is a dataframe which contains movies based on their unique genres

In [21]:
# just a different way
df2 = dff.join(dff.pop('genres').str.get_dummies('|'))
df2.head()
Out[21]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross actor_1_name ... Horror Music Musical Mystery Romance Sci-Fi Sport Thriller War Western
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 CCH Pounder ... 0 0 0 0 0 1 0 0 0 0
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Johnny Depp ... 0 0 0 0 0 0 0 0 0 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Christoph Waltz ... 0 0 0 0 0 0 0 1 0 0
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Tom Hardy ... 0 0 0 0 0 0 0 1 0 0
5 Color Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 73058679.0 Daryl Sabara ... 0 0 0 0 0 1 0 0 0 0

5 rows × 49 columns

In [22]:
b['genres'].value_counts().plot(kind='bar')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x26a9ebeb6a0>
In [23]:
b.groupby('genres')['imdb_score'].mean().plot.bar()
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x26a9ebeb0b8>

These plots shows most of the genres are drama, comedy, thriller or action but they don't usually have the highest IMDB value

In the example they digged in to content_rating and tried to replace some values with other values. I really don't see any necessity to do that but to show that I could do that I'll do and make a different dataframe for it!

In [24]:
dff['content_rating'].value_counts()
Out[24]:
R            1687
PG-13        1291
PG            563
G              87
Not Rated      34
Unrated        22
Approved       17
X              10
NC-17           6
Passed          3
M               2
GP              1
Name: content_rating, dtype: int64
In [25]:
#First method
dff_replace = dff.replace({'content_rating': 'GP'}, {'content_rating': 'R'});
dff_replace = dff_replace.replace({'content_rating': 'M'}, {'content_rating': 'R'});
dff_replace = dff_replace.replace({'content_rating': 'Passed'}, {'content_rating': 'R'});
dff_replace = dff_replace.replace({'content_rating': 'NC-17'}, {'content_rating': 'R'});
dff_replace['content_rating'].value_counts()
Out[25]:
R            1699
PG-13        1291
PG            563
G              87
Not Rated      34
Unrated        22
Approved       17
X              10
Name: content_rating, dtype: int64
In [26]:
#second method
# create a function
def f(x):
    if x == 'R': return 'R'
    elif x == 'PG-13': return 'PG-13'
    elif x == 'PG': return 'PG'
    elif x == 'G': return 'G'
    elif x == 'Not Rated': return 'Not Rated'
    elif x == 'Unrated': return 'Unrated'
    elif x == 'Approved': return 'Approved'
    elif x == 'X': return 'X'
    else: return 'R'
    
    
In [27]:
dff_replace['content_rating'] = dff['content_rating'].apply(f)
dff_replace['content_rating'].value_counts()
Out[27]:
R            1699
PG-13        1291
PG            563
G              87
Not Rated      34
Unrated        22
Approved       17
X              10
Name: content_rating, dtype: int64

So we have a clean data now and we can go ahead and do some analysis!

Business intelligence

first of all to compare success in terms of money we need to know about profit of each movie. Since we have gross (income) and budget then we can find profit which is equal to gross - profit

In [28]:
dff['profit'] = dff ['gross'] - dff['budget']
dff.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 0 to 5042
Data columns (total 28 columns):
color                        3723 non-null object
director_name                3723 non-null object
num_critic_for_reviews       3723 non-null float64
duration                     3723 non-null float64
director_facebook_likes      3723 non-null float64
actor_3_facebook_likes       3723 non-null float64
actor_2_name                 3723 non-null object
actor_1_facebook_likes       3723 non-null float64
gross                        3723 non-null float64
actor_1_name                 3723 non-null object
movie_title                  3723 non-null object
num_voted_users              3723 non-null int64
cast_total_facebook_likes    3723 non-null int64
actor_3_name                 3723 non-null object
facenumber_in_poster         3723 non-null float64
plot_keywords                3723 non-null object
movie_imdb_link              3723 non-null object
num_user_for_reviews         3723 non-null float64
language                     3723 non-null object
country                      3723 non-null object
content_rating               3723 non-null object
budget                       3723 non-null float64
title_year                   3723 non-null float64
actor_2_facebook_likes       3723 non-null float64
imdb_score                   3723 non-null float64
aspect_ratio                 3723 non-null float64
movie_facebook_likes         3723 non-null int64
profit                       3723 non-null float64
dtypes: float64(14), int64(3), object(11)
memory usage: 1003.5+ KB
C:\Users\mnajjartabar\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [29]:
dff['return_on_investment_perc'] = (dff ['profit'] / dff['budget']) * 100
dff.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 0 to 5042
Data columns (total 29 columns):
color                        3723 non-null object
director_name                3723 non-null object
num_critic_for_reviews       3723 non-null float64
duration                     3723 non-null float64
director_facebook_likes      3723 non-null float64
actor_3_facebook_likes       3723 non-null float64
actor_2_name                 3723 non-null object
actor_1_facebook_likes       3723 non-null float64
gross                        3723 non-null float64
actor_1_name                 3723 non-null object
movie_title                  3723 non-null object
num_voted_users              3723 non-null int64
cast_total_facebook_likes    3723 non-null int64
actor_3_name                 3723 non-null object
facenumber_in_poster         3723 non-null float64
plot_keywords                3723 non-null object
movie_imdb_link              3723 non-null object
num_user_for_reviews         3723 non-null float64
language                     3723 non-null object
country                      3723 non-null object
content_rating               3723 non-null object
budget                       3723 non-null float64
title_year                   3723 non-null float64
actor_2_facebook_likes       3723 non-null float64
imdb_score                   3723 non-null float64
aspect_ratio                 3723 non-null float64
movie_facebook_likes         3723 non-null int64
profit                       3723 non-null float64
return_on_investment_perc    3723 non-null float64
dtypes: float64(15), int64(3), object(11)
memory usage: 1.0+ MB
C:\Users\mnajjartabar\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [30]:
#dff = dff.set_index('movie_title');

dff['profit'].head()
Out[30]:
0    523505847.0
1      9404152.0
2    -44925825.0
3    198130642.0
5   -190641321.0
Name: profit, dtype: float64
In [31]:
best_profit = dff.pivot_table(index='movie_title', aggfunc='sum', fill_value=0).sort_values(by=['profit'], ascending=False).head(20)
best_profit.head()
Out[31]:
actor_1_facebook_likes actor_2_facebook_likes actor_3_facebook_likes aspect_ratio budget cast_total_facebook_likes director_facebook_likes duration facenumber_in_poster gross imdb_score movie_facebook_likes num_critic_for_reviews num_user_for_reviews num_voted_users profit return_on_investment_perc title_year
movie_title
Avatar 1000 936 855 1.78 237000000 4834 0 178 0 760505847 7.9 33000 723 3054 886204 523505847 220.888543 2009
Jurassic World 3000 2000 1000 2.00 150000000 8458 365 124 0 652177271 7.0 150000 644 1290 418214 502177271 334.784847 2015
Titanic 29000 14000 794 2.35 200000000 45223 0 194 0 658672302 7.7 26000 315 2528 793059 458672302 229.336151 1997
Star Wars: Episode IV - A New Hope 11000 1000 504 2.35 11000000 13485 0 125 1 460935665 8.7 33000 282 1470 911097 449935665 4090.324227 1977
E.T. the Extra-Terrestrial 861 725 548 1.85 10500000 2811 14000 120 0 434949459 7.9 34000 215 515 281842 424449459 4042.375800 1982
In [32]:
plt.figure(figsize=[16,16])

pieprofit = dff[['movie_title','profit','budget','return_on_investment_perc', 'gross','imdb_score']];
pieprofit = pieprofit.sort_values(by=['profit'], ascending = False).head(10);
explode = (0.2, 0, 0, 0, 0.1, 0, 0, 0.15, 0, 0)  # explode 1st slice


colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue','maroon',  'aqua', 'khaki', 'darkturquoise', 'hotpink', 'mediumpurple']

plt.pie(pieprofit['profit'], labels=pieprofit['movie_title'], explode=explode, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
Out[32]:
(-1.3104572272260597,
 1.151403801508614,
 -1.1099411282454879,
 1.2327149813997411)

So Avatar, Jurassic World, Titanic and Starwas: Episode IV had the best profit.

In [33]:
dff['color'].value_counts().head()
Out[33]:
Color               3600
 Black and White     123
Name: color, dtype: int64

almost all of movies are color and just 123 of them are black and white. we can remove color column here but personally I don't like miss any information so I'll keep it.

In [34]:
plt.figure(figsize=[16,16])

piereturn = dff[['movie_title','profit','budget','return_on_investment_perc', 'gross','imdb_score']];
piereturn = piereturn.sort_values(by=['return_on_investment_perc'], ascending = False).head(10);
explode = (0.2, 0, 0, 0, 0.1, 0, 0, 0.15, 0, 0)  # explode 1st slice


colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue','maroon',  'aqua', 'khaki', 'darkturquoise', 'hotpink', 'mediumpurple']

plt.pie(piereturn['return_on_investment_perc'], labels=piereturn['movie_title'], explode=explode, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
Out[34]:
(-1.241485292377227,
 1.1186168991034913,
 -1.2760300324179432,
 1.1941261338796085)
In [35]:
piereturn = dff[['movie_title','profit','budget','return_on_investment_perc', 'gross','imdb_score']];
piereturn = piereturn.sort_values(by=['return_on_investment_perc'], ascending = False).head(20)
#px.scatter(piereturn, x="budget", y="return_on_investment_perc", text ='movie_title' , trendline="lowess")
piereturn
Out[35]:
movie_title profit budget return_on_investment_perc gross imdb_score
4793 Paranormal Activity 107902283.0 15000.0 719348.553333 107917283.0 6.3
4799 Tarnation 591796.0 218.0 271466.055046 592014.0 7.2
4707 The Blair Witch Project 140470114.0 60000.0 234116.856667 140530114.0 6.4
4984 The Brothers McMullen 10221600.0 25000.0 40886.400000 10246600.0 6.6
4936 The Texas Chain Saw Massacre 30775468.0 83532.0 36842.728535 30859000.0 7.5
3278 The Texas Chain Saw Massacre 30775468.0 83532.0 36842.728535 30859000.0 7.5
5035 El Mariachi 2033920.0 7000.0 29056.000000 2040920.0 6.9
4956 The Gallows 22657819.0 100000.0 22657.819000 22757819.0 4.2
4977 Super Size Me 11464368.0 65000.0 17637.489231 11529368.0 7.3
4821 Halloween 46700000.0 300000.0 15566.666667 47000000.0 7.9
2492 Halloween 46700000.0 300000.0 15566.666667 47000000.0 7.9
4674 American Graffiti 114223000.0 777000.0 14700.514801 115000000.0 7.5
4530 Rocky 116275247.0 960000.0 12112.004896 117235247.0 8.1
5011 In the Company of Men 2831622.0 25000.0 11326.488000 2856622.0 7.3
4791 Napoleon Dynamite 44140956.0 400000.0 11035.239000 44540956.0 6.9
4955 Facing the Giants 10074663.0 100000.0 10074.663000 10174663.0 6.7
4449 Snow White and the Seven Dwarfs 182925485.0 2000000.0 9146.274250 184925485.0 7.7
4725 Benji 39052600.0 500000.0 7810.520000 39552600.0 6.1
5042 My Date with Drew 84122.0 1100.0 7647.454545 85222.0 6.6
5027 The Circle 663780.0 10000.0 6637.800000 673780.0 7.5

Is language an important factor for imdb score? What about country?

In [36]:
#let's get some general information from language and country
dff['language'].value_counts().head(16)
Out[36]:
English       3566
French          34
Spanish         23
Mandarin        14
Japanese        10
German          10
Italian          7
Cantonese        7
Portuguese       5
Hindi            5
Korean           5
Norwegian        4
Dutch            3
Danish           3
Persian          3
Thai             3
Name: language, dtype: int64

Three persian movies on the list. Intresting!

I'm Persian

In [37]:
dff['country'].value_counts().head(20)
Out[37]:
USA            2961
UK              313
France          101
Germany          79
Canada           59
Australia        39
Spain            21
Japan            15
Hong Kong        13
China            12
New Zealand      11
Italy            11
Denmark           8
South Korea       8
Ireland           7
Mexico            6
Brazil            5
India             5
Thailand          4
Iran              4
Name: country, dtype: int64

There are 4 movies from Iran and 3 movies in persian! something might be wrong or might be one movie in Kurdish, Azeri or other Iranian trib's language

Dig into Iranian and Persian movies

In [38]:
e = dff.loc[df['country'] == 'Iran']
e.groupby(['movie_title','imdb_score'])['profit'].sum().sort_values(ascending=False).head()
Out[38]:
movie_title          imdb_score
A Separation         8.4            6598492.0
Children of Heaven   8.5             745402.0
The Circle           7.5             663780.0
Caravans             6.5          -13000000.0
Name: profit, dtype: float64
In [39]:
e = dff.loc[df['language'] == 'Persian']
e.groupby(['movie_title','imdb_score'])['profit'].sum().sort_values(ascending=False).head()
Out[39]:
movie_title          imdb_score
A Separation         8.4           6598492.0
Children of Heaven   8.5            745402.0
The Circle           7.5            663780.0
Name: profit, dtype: float64

Ok now I found the problem. Caravans was shot in Afghanistan and iran and starred Anthony Quinn, Jennifer O'Neill, and Michael Sarrazin.

In [40]:
# I want to draw pareto chart

def pareto_plot(df, x=None, y=None, title=None, show_pct_y=False, pct_format='{0:.0%}'):
    xlabel = x
    ylabel = y
    tmp = df.sort_values(y, ascending=False)
    x = tmp[x].values
    y = tmp[y].values
    weights = y / y.sum()
    cumsum = weights.cumsum()
    
    fig, ax1 = plt.subplots()
    ax1.bar(x, y)
    ax1.set_xlabel(xlabel)
    ax1.set_ylabel(ylabel)

    ax2 = ax1.twinx()
    ax2.plot(x, cumsum, '-ro', alpha=0.5)
    ax2.set_ylabel('', color='r')
    ax2.tick_params('y', colors='r')
    
    vals = ax2.get_yticks()
    ax2.set_yticklabels(['{:,.2%}'.format(x) for x in vals])

    # hide y-labels on right side
    if not show_pct_y:
        ax2.set_yticks([])
    
    formatted_weights = [pct_format.format(x) for x in cumsum]
    for i, txt in enumerate(formatted_weights):
        ax2.annotate(txt, (x[i], cumsum[i]), fontweight='heavy')    
    
    if title:
        plt.title(title)
    
    plt.tight_layout()
    plt.show()
In [41]:
pareto = dff.sort_values(by=['budget'], ascending = False).head(8)
In [42]:
pareto_plot(pareto, x='movie_title', y='gross', title='profit pareto')
In [43]:
dff.groupby(['movie_title','imdb_score'])['profit'].sum().sort_values(ascending=False).head(10)
Out[43]:
movie_title                                 imdb_score
Avatar                                      7.9           523505847.0
Jurassic World                              7.0           502177271.0
Titanic                                     7.7           458672302.0
Star Wars: Episode IV - A New Hope          8.7           449935665.0
E.T. the Extra-Terrestrial                  7.9           424449459.0
The Avengers                                8.1           403279547.0
The Lion King                               8.5           377783777.0
The Jungle Book                             7.8           375290282.0
Star Wars: Episode I - The Phantom Menace   6.5           359544677.0
The Dark Knight                             9.0           348316061.0
Name: profit, dtype: float64
In [44]:
dff.groupby(['movie_title','profit'])['imdb_score'].sum().sort_values(ascending=False).head(10)
Out[44]:
movie_title           profit      
King Kong              11051260.0     21.6
Home                   42343675.0     20.1
Casino Royale          17007184.0     16.0
Glory                  8830000.0      15.8
Halloween              46700000.0     15.8
The Jungle Book        187645141.0    15.6
Lucky Number Slevin   -4505513.0      15.6
Skyfall                104360277.0    15.6
Juno                   135992840.0    15.0
Eddie the Eagle       -7214368.0      15.0
Name: imdb_score, dtype: float64

woww. "Eddie the Eagle" with negative profit had a great score! That doesn't make sense! Maybe their marketin was not good!

In [45]:
fig = px.scatter(dff, x="imdb_score", y="gross")

fig.add_trace(
    go.Scatter(
        x=[8, 8],
        y=[0, 800000000],
        mode="lines",
        line=go.scatter.Line(color="gray"),
        showlegend=False)
)
fig.add_trace(
    go.Scatter(
        x=[1, 10],
        y=[500000000, 500000000],
        mode="lines",
        line=go.scatter.Line(color="gray"),
        showlegend=False)
)
fig.show()
In [46]:
fig = px.scatter(pieprofit, x="imdb_score", y="gross", text ='movie_title')

fig.add_trace(
    go.Scatter(
        x=[8, 8],
        y=[50000000, 800000000],
        mode="lines",
        line=go.scatter.Line(color="gray"),
        showlegend=False)
)
fig.add_trace(
    go.Scatter(
        x=[6, 9],
        y=[500000000, 500000000],
        mode="lines",
        line=go.scatter.Line(color="gray"),
        showlegend=False)
)
fig.show()

So Avatar had the best profit but The Dark Night had the best IMDB score!

In [47]:
df1 = dff
# setting my own values for bins
df1['imdbscores_bins'] = pd.cut(df1['imdb_score'], bins=[0, 2, 4, 6, 8, 10], labels=['0-1.99', '2-3.99', '4-5.99', '6-7.99', '8-10'],
                   include_lowest=True)
# see the result
df1.head()
C:\Users\mnajjartabar\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[47]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross actor_1_name ... content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes profit return_on_investment_perc imdbscores_bins
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 CCH Pounder ... PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000 523505847.0 220.888543 6-7.99
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Johnny Depp ... PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0 9404152.0 3.134717 6-7.99
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Christoph Waltz ... PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000 -44925825.0 -18.337071 6-7.99
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Tom Hardy ... PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000 198130642.0 79.252257 8-10
5 Color Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 73058679.0 Daryl Sabara ... PG-13 263700000.0 2012.0 632.0 6.6 2.35 24000 -190641321.0 -72.294775 6-7.99

5 rows × 30 columns

Here I made bins for IMDB scores and changed their name.

In [48]:
#labels=['trash', 'tolerable', 'good', 'accaptable', 'incredible', 'unbelievable']
df1 = df1.replace({'imdbscores_bins': "6-7.99"}, {'imdbscores_bins': 'incredible'})
df1 = df1.replace({'imdbscores_bins': "0-1.99"}, {'imdbscores_bins': 'trash'})
df1 = df1.replace({'imdbscores_bins': "2-3.99"}, {'imdbscores_bins': 'tolerable'})
df1 = df1.replace({'imdbscores_bins': "4-5.99"}, {'imdbscores_bins': 'accaptable'})
df1 = df1.replace({'imdbscores_bins': "8-10"}, {'imdbscores_bins': 'unbelievable'})
df1.head()
Out[48]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross actor_1_name ... content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes profit return_on_investment_perc imdbscores_bins
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 CCH Pounder ... PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000 523505847.0 220.888543 incredible
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Johnny Depp ... PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0 9404152.0 3.134717 incredible
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Christoph Waltz ... PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000 -44925825.0 -18.337071 incredible
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Tom Hardy ... PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000 198130642.0 79.252257 unbelievable
5 Color Andrew Stanton 462.0 132.0 475.0 530.0 Samantha Morton 640.0 73058679.0 Daryl Sabara ... PG-13 263700000.0 2012.0 632.0 6.6 2.35 24000 -190641321.0 -72.294775 incredible

5 rows × 30 columns

In [49]:
fig = px.scatter_3d(dff, x='return_on_investment_perc', y='country', z='imdb_score')
fig.show()
In [50]:
fig = px.scatter_3d(df1, x='return_on_investment_perc', y='country', z='imdbscores_bins')
fig.show()

a huge number of movies has imdb better than 8 which I named as incredible

In [51]:
fig = px.scatter_3d(pieprofit, x='profit', y='return_on_investment_perc', z='imdb_score', text = 'movie_title')
fig.show()
In [52]:
fig = px.scatter(dff, x="country", y="imdb_score", marginal_y="rug", marginal_x="histogram")
fig.show()
In [53]:
fig = px.scatter(dff, x="gross", y="imdb_score", marginal_y="violin",
           marginal_x="box", trendline="ols")
fig.show()
C:\Users\mnajjartabar\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning:

Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.

In [54]:
plt.figure(figsize=[24,16])
fig = px.scatter_matrix(dff, dimensions=["imdb_score", "actor_1_facebook_likes", "actor_2_facebook_likes"])
fig.show()
<Figure size 1728x1152 with 0 Axes>

I don't see any strong relations between actors facebook likes and imdb score

In [55]:
df1['imdbscores_bins'].value_counts().head()
Out[55]:
incredible      2435
accaptable      1041
unbelievable     156
tolerable         87
trash              4
Name: imdbscores_bins, dtype: int64
In [57]:
fig = px.parallel_coordinates(dff, color="imdb_score", labels={"actor_1_facebook_likes": "Actor 1 likes",
                  "actor_2_facebook_likes": "Actor 2 likes", "actor_3_facebook_likes": "Actor 3 likes",
                  "director_facebook_likes": "Director likes", },
                    color_continuous_scale=px.colors.diverging.Tealrose, color_continuous_midpoint=2)
fig.show()
In [58]:
earth = dff.groupby(['country', 'gross'])['title_year'].sum(
).sort_values()
earth.head()
Out[58]:
country  gross    
Germany  26435.0      1927.0
USA      2808000.0    1929.0
         2300000.0    1933.0
         3000000.0    1935.0
         163245.0     1936.0
Name: title_year, dtype: float64
In [59]:
earth2=dff.pivot_table(index=['country','title_year'], values='gross', 
                   aggfunc='sum', fill_value=0, margins=True).reset_index()
#earth2=earth2['country'].value_counts()

earth2.head()
Out[59]:
country title_year gross
0 Afghanistan 2003 1127331
1 Argentina 2000 1221261
2 Argentina 2004 304124
3 Argentina 2009 20167424
4 Aruba 1998 10076136
In [60]:
fig = px.choropleth(earth2, locations="country", color="gross", hover_name="country", animation_frame="title_year", range_color=[0,999999999])
fig.show()

I gave a shot but didn't work. Didn't have enough time otherwise I could make it

In [61]:
#dff.groupby('imdb_score').hist(figsize=(10,10));
df1.groupby('imdbscores_bins')['actor_1_facebook_likes'].sum().plot.bar();
In [62]:
df1.groupby('imdbscores_bins')['actor_2_facebook_likes'].sum().plot.bar();
In [63]:
df1.groupby('imdbscores_bins')['actor_3_facebook_likes'].sum().plot.bar();
In [64]:
sns.lmplot("actor_1_facebook_likes", "director_facebook_likes", df1, hue="imdbscores_bins", x_jitter=.15, height=8)
Out[64]:
<seaborn.axisgrid.FacetGrid at 0x26aa4a510b8>

Now check each variable with imdb score

In [65]:
#ff.groupby('imdbscores_bins')['duration'].sum().plot.bar();
sns.violinplot("duration", "imdbscores_bins", data=df1,
               palette=["lightblue", "lightpink"])
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x26aa4c2ff98>
In [66]:
#dff.groupby('imdbscores_bins')['num_critic_for_reviews'].sum().plot.bar();
sns.violinplot("num_critic_for_reviews", "imdbscores_bins", data=df1,
               palette=["lightblue", "lightpink"])
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x26aa4ce94a8>

Correlation analysis

In [67]:
dffcorr = dff.corr()
dffcorr
Out[67]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes profit return_on_investment_perc
num_critic_for_reviews 1.000000 0.227619 0.175715 0.245440 0.165648 0.460797 0.591586 0.233781 -0.035603 0.562596 0.103921 0.420184 0.251119 0.349825 0.179809 0.703579 0.037177 0.032508
duration 0.227619 1.000000 0.180397 0.120776 0.082901 0.242610 0.338953 0.117736 0.027522 0.352083 0.067380 -0.131689 0.126651 0.367388 0.153353 0.212977 0.007119 -0.033647
director_facebook_likes 0.175715 0.180397 1.000000 0.120584 0.090467 0.139729 0.302766 0.120076 -0.047577 0.218876 0.018105 -0.045718 0.117858 0.193125 0.037171 0.162641 0.024457 -0.006550
actor_3_facebook_likes 0.245440 0.120776 0.120584 1.000000 0.252450 0.281238 0.257632 0.485596 0.104853 0.200485 0.038207 0.116479 0.550328 0.064187 0.047918 0.259922 0.047490 -0.012651
actor_1_facebook_likes 0.165648 0.082901 0.090467 0.252450 1.000000 0.142577 0.178009 0.946075 0.055769 0.121393 0.016027 0.095997 0.390411 0.092927 0.056816 0.128804 0.027363 -0.016097
gross 0.460797 0.242610 0.139729 0.281238 0.142577 1.000000 0.622714 0.227955 -0.034389 0.545656 0.098318 0.053163 0.243977 0.215510 0.065903 0.358630 0.205771 0.018064
num_voted_users 0.591586 0.338953 0.302766 0.257632 0.178009 0.622714 1.000000 0.243834 -0.035822 0.779191 0.065398 0.023488 0.239126 0.482583 0.087079 0.514855 0.124032 0.009829
cast_total_facebook_likes 0.233781 0.117736 0.120076 0.485596 0.946075 0.227955 0.243834 1.000000 0.078250 0.176231 0.027670 0.126650 0.640366 0.105397 0.069464 0.199995 0.041736 -0.019215
facenumber_in_poster -0.035603 0.027522 -0.047577 0.104853 0.055769 -0.034389 -0.035822 0.078250 1.000000 -0.082542 -0.022107 0.069622 0.070261 -0.067010 0.019358 0.012960 0.011339 -0.013605
num_user_for_reviews 0.562596 0.352083 0.218876 0.200485 0.121393 0.545656 0.779191 0.176231 -0.082542 1.000000 0.070271 0.020059 0.184704 0.325026 0.098260 0.368575 0.095933 0.068167
budget 0.103921 0.067380 0.018105 0.038207 0.016027 0.098318 0.065398 0.027670 -0.022107 0.070271 1.000000 0.046340 0.034691 0.029462 0.024978 0.051059 -0.953628 -0.008115
title_year 0.420184 -0.131689 -0.045718 0.116479 0.095997 0.053163 0.023488 0.126650 0.069622 0.020059 0.046340 1.000000 0.123050 -0.132274 0.215610 0.305520 -0.029489 -0.015425
actor_2_facebook_likes 0.251119 0.126651 0.117858 0.550328 0.390411 0.243977 0.239126 0.640366 0.070261 0.184704 0.034691 0.123050 1.000000 0.100290 0.065212 0.226923 0.039678 -0.014177
imdb_score 0.349825 0.367388 0.193125 0.064187 0.092927 0.215510 0.482583 0.105397 -0.067010 0.325026 0.029462 -0.132274 0.100290 1.000000 0.031060 0.284034 0.036209 0.010068
aspect_ratio 0.179809 0.153353 0.037171 0.047918 0.056816 0.065903 0.087079 0.069464 0.019358 0.098260 0.024978 0.215610 0.065212 0.031060 1.000000 0.110104 -0.004630 -0.042060
movie_facebook_likes 0.703579 0.212977 0.162641 0.259922 0.128804 0.358630 0.514855 0.199995 0.012960 0.368575 0.051059 0.305520 0.226923 0.284034 0.110104 1.000000 0.058259 -0.003242
profit 0.037177 0.007119 0.024457 0.047490 0.027363 0.205771 0.124032 0.041736 0.011339 0.095933 -0.953628 -0.029489 0.039678 0.036209 -0.004630 0.058259 1.000000 0.013443
return_on_investment_perc 0.032508 -0.033647 -0.006550 -0.012651 -0.016097 0.018064 0.009829 -0.019215 -0.013605 0.068167 -0.008115 -0.015425 -0.014177 0.010068 -0.042060 -0.003242 0.013443 1.000000
In [68]:
dffcorr ['imdb_score']
Out[68]:
num_critic_for_reviews       0.349825
duration                     0.367388
director_facebook_likes      0.193125
actor_3_facebook_likes       0.064187
actor_1_facebook_likes       0.092927
gross                        0.215510
num_voted_users              0.482583
cast_total_facebook_likes    0.105397
facenumber_in_poster        -0.067010
num_user_for_reviews         0.325026
budget                       0.029462
title_year                  -0.132274
actor_2_facebook_likes       0.100290
imdb_score                   1.000000
aspect_ratio                 0.031060
movie_facebook_likes         0.284034
profit                       0.036209
return_on_investment_perc    0.010068
Name: imdb_score, dtype: float64

So briefly we can see here that num_critic_for_reviews, duration, num_voted_users, num_user_for_reviews, and movie_facebook_likes are highly positive correlated to imdb score and title year is negativly and relativly correlated which means people probably won't like the most recent movies as much as they liked old movies but probably fact is each movies takes times to get more viewers and increase imdb score and older movies had enough time for that

In [69]:
plt.figure(figsize=(14,10))
sns.heatmap(dff.corr(), vmax=.8, square=True, annot=True, fmt=".1f")
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x26a9f0f86a0>
In [70]:
sns.jointplot("imdb_score", "duration", dff, kind="hex", color="#8855AA")
Out[70]:
<seaborn.axisgrid.JointGrid at 0x26aa5425a58>
In [71]:
sns.jointplot("imdb_score", "num_critic_for_reviews", dff, kind="hex", color="#8855AA")
Out[71]:
<seaborn.axisgrid.JointGrid at 0x26aa5087940>
In [72]:
sns.jointplot("imdb_score", "num_user_for_reviews", dff, kind="hex", color="#8855AA")
Out[72]:
<seaborn.axisgrid.JointGrid at 0x26aa54eec88>

ANOVA test

In [73]:
cw_lm=ols('imdb_score ~ num_critic_for_reviews + duration + director_facebook_likes + actor_3_facebook_likes + actor_1_facebook_likes + gross + num_voted_users + cast_total_facebook_likes + facenumber_in_poster + num_user_for_reviews + budget + actor_2_facebook_likes + actor_2_facebook_likes + aspect_ratio + movie_facebook_likes + profit + return_on_investment_perc', 
          data=dff).fit() #Specify C for Categorical
print(sm.stats.anova_lm(cw_lm, typ=2))
                                sum_sq      df            F         PR(>F)
num_critic_for_reviews       58.182211     1.0    78.579237   1.170908e-18
duration                    245.302714     1.0   331.298859   5.670390e-71
director_facebook_likes       1.336854     1.0     1.805517   1.791292e-01
actor_3_facebook_likes        8.703798     1.0    11.755102   6.133820e-04
actor_1_facebook_likes       23.520899     1.0    31.766656   1.868087e-08
gross                      1718.646718     1.0  2321.155306   0.000000e+00
num_voted_users             382.360667     1.0   516.405427  3.893240e-107
cast_total_facebook_likes    23.263381     1.0    31.418860   2.231108e-08
facenumber_in_poster         14.582572     1.0    19.694807   9.349008e-06
num_user_for_reviews         73.443748     1.0    99.191034   4.458689e-23
budget                     1700.292273     1.0  2296.366317   0.000000e+00
actor_2_facebook_likes       22.150485     1.0    29.915814   4.810391e-08
aspect_ratio                 11.278462     1.0    15.232369   9.674904e-05
movie_facebook_likes          5.749855     1.0     7.765590   5.352189e-03
profit                     1701.690998     1.0  2298.255394   0.000000e+00
return_on_investment_perc     1.607580     1.0     2.171151   1.407058e-01
Residual                   2744.763940  3707.0          NaN            NaN

ANOVA interpretation

Instead of doing several t-test I did one ANOVA test with imdb score as response.

The ANOVA table shows only director_facebook_likes and return_on_investment_perc (which I made later) are not significant and the rest values are significant to predict IMDB score

Regression analysis

To do regression analysis, I took main important factors based on correlation and ANOVA test to predict scores and built comprehensive model with all those factors and seperatley with each of them. Here I run several regression and then explain results at the end

In [75]:
# Anova test

cw_lm2=ols('imdb_score ~ num_critic_for_reviews + duration + num_voted_users + num_user_for_reviews + movie_facebook_likes', 
          data=dff).fit() #Specify C for Categorical
print(sm.stats.anova_lm(cw_lm2, typ=2))
                             sum_sq      df           F        PR(>F)
num_critic_for_reviews    44.453784     1.0   57.483265  4.279843e-14
duration                 224.778891     1.0  290.661974  8.220168e-63
num_voted_users          361.807023     1.0  467.853290  7.976836e-98
num_user_for_reviews      79.180650     1.0  102.388636  9.243716e-24
movie_facebook_likes       7.989624     1.0   10.331397  1.319048e-03
Residual                2874.483803  3717.0         NaN           NaN
In [76]:
runs_reg_model1 = ols("imdb_score ~ num_critic_for_reviews + duration + num_voted_users + num_user_for_reviews + movie_facebook_likes",dff)
runs_reg1 = runs_reg_model1.fit()
print(runs_reg1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     325.2
Date:                Tue, 10 Dec 2019   Prob (F-statistic):          1.26e-289
Time:                        23:53:33   Log-Likelihood:                -4801.2
No. Observations:                3723   AIC:                             9614.
Df Residuals:                    3717   BIC:                             9652.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  4.7960      0.076     62.931      0.000       4.647       4.945
num_critic_for_reviews     0.0014      0.000      7.582      0.000       0.001       0.002
duration                   0.0117      0.001     17.049      0.000       0.010       0.013
num_voted_users         3.584e-06   1.66e-07     21.630      0.000    3.26e-06    3.91e-06
num_user_for_reviews      -0.0006   5.93e-05    -10.119      0.000      -0.001      -0.000
movie_facebook_likes   -3.175e-06   9.88e-07     -3.214      0.001   -5.11e-06   -1.24e-06
==============================================================================
Omnibus:                      421.868   Durbin-Watson:                   1.802
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              732.675
Skew:                          -0.769   Prob(JB):                    7.97e-160
Kurtosis:                       4.536   Cond. No.                     9.80e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [77]:
runs_reg1.mse_resid
Out[77]:
0.7733343564000075
In [78]:
runs_reg_model = ols("imdb_score ~ profit",dff)
runs_reg2 = runs_reg_model.fit()
print(runs_reg1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     325.2
Date:                Tue, 10 Dec 2019   Prob (F-statistic):          1.26e-289
Time:                        23:53:33   Log-Likelihood:                -4801.2
No. Observations:                3723   AIC:                             9614.
Df Residuals:                    3717   BIC:                             9652.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  4.7960      0.076     62.931      0.000       4.647       4.945
num_critic_for_reviews     0.0014      0.000      7.582      0.000       0.001       0.002
duration                   0.0117      0.001     17.049      0.000       0.010       0.013
num_voted_users         3.584e-06   1.66e-07     21.630      0.000    3.26e-06    3.91e-06
num_user_for_reviews      -0.0006   5.93e-05    -10.119      0.000      -0.001      -0.000
movie_facebook_likes   -3.175e-06   9.88e-07     -3.214      0.001   -5.11e-06   -1.24e-06
==============================================================================
Omnibus:                      421.868   Durbin-Watson:                   1.802
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              732.675
Skew:                          -0.769   Prob(JB):                    7.97e-160
Kurtosis:                       4.536   Cond. No.                     9.80e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [79]:
runs_reg2.mse_resid
Out[79]:
1.1090070967023982
In [80]:
runs_reg_model = ols("imdb_score ~ num_critic_for_reviews",dff)
runs_reg3 = runs_reg_model.fit()
print(runs_reg1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     325.2
Date:                Tue, 10 Dec 2019   Prob (F-statistic):          1.26e-289
Time:                        23:53:34   Log-Likelihood:                -4801.2
No. Observations:                3723   AIC:                             9614.
Df Residuals:                    3717   BIC:                             9652.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  4.7960      0.076     62.931      0.000       4.647       4.945
num_critic_for_reviews     0.0014      0.000      7.582      0.000       0.001       0.002
duration                   0.0117      0.001     17.049      0.000       0.010       0.013
num_voted_users         3.584e-06   1.66e-07     21.630      0.000    3.26e-06    3.91e-06
num_user_for_reviews      -0.0006   5.93e-05    -10.119      0.000      -0.001      -0.000
movie_facebook_likes   -3.175e-06   9.88e-07     -3.214      0.001   -5.11e-06   -1.24e-06
==============================================================================
Omnibus:                      421.868   Durbin-Watson:                   1.802
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              732.675
Skew:                          -0.769   Prob(JB):                    7.97e-160
Kurtosis:                       4.536   Cond. No.                     9.80e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [81]:
runs_reg3.mse_resid
Out[81]:
0.97456765259219
In [82]:
runs_reg_model = ols("imdb_score ~ duration",dff)
runs_reg4 = runs_reg_model.fit()
print(runs_reg1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     325.2
Date:                Tue, 10 Dec 2019   Prob (F-statistic):          1.26e-289
Time:                        23:53:34   Log-Likelihood:                -4801.2
No. Observations:                3723   AIC:                             9614.
Df Residuals:                    3717   BIC:                             9652.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  4.7960      0.076     62.931      0.000       4.647       4.945
num_critic_for_reviews     0.0014      0.000      7.582      0.000       0.001       0.002
duration                   0.0117      0.001     17.049      0.000       0.010       0.013
num_voted_users         3.584e-06   1.66e-07     21.630      0.000    3.26e-06    3.91e-06
num_user_for_reviews      -0.0006   5.93e-05    -10.119      0.000      -0.001      -0.000
movie_facebook_likes   -3.175e-06   9.88e-07     -3.214      0.001   -5.11e-06   -1.24e-06
==============================================================================
Omnibus:                      421.868   Durbin-Watson:                   1.802
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              732.675
Skew:                          -0.769   Prob(JB):                    7.97e-160
Kurtosis:                       4.536   Cond. No.                     9.80e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [83]:
runs_reg4.mse_resid
Out[83]:
0.9605795086146524
In [84]:
runs_reg_model = ols("imdb_score ~ num_voted_users",dff)
runs_reg5 = runs_reg_model.fit()
print(runs_reg1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     325.2
Date:                Tue, 10 Dec 2019   Prob (F-statistic):          1.26e-289
Time:                        23:53:34   Log-Likelihood:                -4801.2
No. Observations:                3723   AIC:                             9614.
Df Residuals:                    3717   BIC:                             9652.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  4.7960      0.076     62.931      0.000       4.647       4.945
num_critic_for_reviews     0.0014      0.000      7.582      0.000       0.001       0.002
duration                   0.0117      0.001     17.049      0.000       0.010       0.013
num_voted_users         3.584e-06   1.66e-07     21.630      0.000    3.26e-06    3.91e-06
num_user_for_reviews      -0.0006   5.93e-05    -10.119      0.000      -0.001      -0.000
movie_facebook_likes   -3.175e-06   9.88e-07     -3.214      0.001   -5.11e-06   -1.24e-06
==============================================================================
Omnibus:                      421.868   Durbin-Watson:                   1.802
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              732.675
Skew:                          -0.769   Prob(JB):                    7.97e-160
Kurtosis:                       4.536   Cond. No.                     9.80e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [85]:
runs_reg5.mse_resid
Out[85]:
0.8518515801216262
In [86]:
runs_reg_model = ols("imdb_score ~ num_user_for_reviews",dff)
runs_reg6 = runs_reg_model.fit()
print(runs_reg1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     325.2
Date:                Tue, 10 Dec 2019   Prob (F-statistic):          1.26e-289
Time:                        23:53:34   Log-Likelihood:                -4801.2
No. Observations:                3723   AIC:                             9614.
Df Residuals:                    3717   BIC:                             9652.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  4.7960      0.076     62.931      0.000       4.647       4.945
num_critic_for_reviews     0.0014      0.000      7.582      0.000       0.001       0.002
duration                   0.0117      0.001     17.049      0.000       0.010       0.013
num_voted_users         3.584e-06   1.66e-07     21.630      0.000    3.26e-06    3.91e-06
num_user_for_reviews      -0.0006   5.93e-05    -10.119      0.000      -0.001      -0.000
movie_facebook_likes   -3.175e-06   9.88e-07     -3.214      0.001   -5.11e-06   -1.24e-06
==============================================================================
Omnibus:                      421.868   Durbin-Watson:                   1.802
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              732.675
Skew:                          -0.769   Prob(JB):                    7.97e-160
Kurtosis:                       4.536   Cond. No.                     9.80e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [87]:
runs_reg6.mse_resid
Out[87]:
0.9931515713762974
In [88]:
runs_reg_model = ols("imdb_score ~ movie_facebook_likes",dff)
runs_reg7 = runs_reg_model.fit()
print(runs_reg1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     325.2
Date:                Tue, 10 Dec 2019   Prob (F-statistic):          1.26e-289
Time:                        23:53:35   Log-Likelihood:                -4801.2
No. Observations:                3723   AIC:                             9614.
Df Residuals:                    3717   BIC:                             9652.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  4.7960      0.076     62.931      0.000       4.647       4.945
num_critic_for_reviews     0.0014      0.000      7.582      0.000       0.001       0.002
duration                   0.0117      0.001     17.049      0.000       0.010       0.013
num_voted_users         3.584e-06   1.66e-07     21.630      0.000    3.26e-06    3.91e-06
num_user_for_reviews      -0.0006   5.93e-05    -10.119      0.000      -0.001      -0.000
movie_facebook_likes   -3.175e-06   9.88e-07     -3.214      0.001   -5.11e-06   -1.24e-06
==============================================================================
Omnibus:                      421.868   Durbin-Watson:                   1.802
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              732.675
Skew:                          -0.769   Prob(JB):                    7.97e-160
Kurtosis:                       4.536   Cond. No.                     9.80e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [89]:
runs_reg7.mse_resid
Out[89]:
1.0208759991455945

all the above regression models has same r-squared valed (.304) but the full model has the lowest mean squared value which means full model is better. Also all p-values are small so all the selected fatcors are significant.

Regression model based on regularization

In [90]:
df5 = dff[['imdb_score', 'num_critic_for_reviews','duration','num_voted_users','num_user_for_reviews', 'movie_facebook_likes']];
In [91]:
y = df5['imdb_score'] 

X = df5.drop(['imdb_score'], axis =1)

model1 = linear_model.Lasso(alpha=1)           
model1.fit(X, y)
model1_y = model1.predict(X)

print('Coefficients: ', model1.coef_)
print("y-intercept ", model1.intercept_)
Coefficients:  [ 1.20306748e-03  9.36447631e-03  3.60658505e-06 -5.36915979e-04
 -2.35245025e-06]
y-intercept  5.053686732988178
In [92]:
coef = ["%.3f" % i for i in model1.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))
Out[92]:
[('num_critic_for_reviews', '0.001'),
 ('duration', '0.009'),
 ('num_voted_users', '0.000'),
 ('num_user_for_reviews', '-0.001'),
 ('movie_facebook_likes', '-0.000')]
In [93]:
print("mean square error: ", mean_squared_error(y, model1_y))
print("variance or r-squared: ", explained_variance_score(y, model1_y))
mean square error:  0.774667126355074
variance or r-squared:  0.30201774314687235

This model gave same MSE but lower r-squered than privous one.

b) Regression model based on Feature selection

In [94]:
X_new = SelectKBest(f_regression, k=2).fit_transform(X, y)
X_new
Out[94]:
array([[1.78000e+02, 8.86204e+05],
       [1.69000e+02, 4.71220e+05],
       [1.48000e+02, 2.75868e+05],
       ...,
       [7.70000e+01, 7.26390e+04],
       [8.10000e+01, 5.20550e+04],
       [9.00000e+01, 4.28500e+03]])
In [95]:
# this helps us find out which variables are selected

selector = SelectKBest(f_regression, k=2).fit(X, y)
idxs_selected = selector.get_support(indices=True)
print(idxs_selected)
[1 2]
In [96]:
model2 = lm.LinearRegression()
model2.fit(X_new, y)
model2_y = model2.predict(X_new)

print("mean square error: ", mean_squared_error(y, model2_y))
print("variance or r-squared: ", explained_variance_score(y, model2_y))
mean square error:  0.7993049500041615
variance or r-squared:  0.2798188358104363
In [97]:
selector = SelectKBest(f_regression, k=3).fit(X, y)
idxs_selected = selector.get_support(indices=True)

model3 = lm.LinearRegression()
model3.fit(X_new, y)
model3_y = model3.predict(X_new)

print("mean square error: ", mean_squared_error(y, model3_y))
print("variance or r-squared: ", explained_variance_score(y, model3_y))
mean square error:  0.7993049500041615
variance or r-squared:  0.2798188358104363

Here neither the MSE nor r-squared are better than first method!

Hierarchical Dendrogram

In [98]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import ward_tree
from scipy.cluster.hierarchy import dendrogram, linkage, ward

from sklearn.metrics import pairwise_distances
In [99]:
np.random.seed(1) # setting random seed to get the same results each time.

agg= AgglomerativeClustering(n_clusters=4, linkage='ward').fit(X)
agg.labels_
Out[99]:
array([2, 0, 0, ..., 1, 1, 1], dtype=int64)
In [100]:
plt.figure(figsize=(16,8))

linkage_matrix = ward(X)
dendrogram(linkage_matrix, orientation="top")
plt.tight_layout() # fixes margins
In [101]:
plt.figure(figsize=(16,8))

plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')

linkage_matrix = ward(X)
dendrogram(linkage_matrix, 
           truncate_mode='lastp',  # show only the last p merged clusters
           p=4,  # show only the last p merged clusters
           #show_leaf_counts=False,  # otherwise numbers in brackets are counts
           leaf_rotation=90.,
           leaf_font_size=12.,
           show_contracted=True,  # to get a distribution impression in truncated branches
           orientation="top")
plt.tight_layout() # fixes margins

PCA

In [102]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)  
print(pca.explained_variance_ratio_)  

print(pca.singular_values_)  
[0.98567396 0.01432292]
[9271556.45698948 1117639.63867542]
In [103]:
pca = PCA(n_components=2, svd_solver='full')
pca.fit(X)                 


print(pca.explained_variance_ratio_)  

print(pca.singular_values_)  
[0.98567396 0.01432292]
[9271556.45698948 1117639.63867542]
In [104]:
pca = PCA(n_components=1, svd_solver='arpack')
pca.fit(X)  


print(pca.explained_variance_ratio_)  

print(pca.singular_values_)  
[0.98567396]
[9271556.45698948]
In [105]:
from sklearn.decomposition import IncrementalPCA

ipca = IncrementalPCA(n_components=2, batch_size=3)
ipca.fit(X)

ipca.transform(X) 
Out[105]:
array([[ 780278.84119654,  -33869.3526188 ],
       [ 363988.33745023,  -36210.38911611],
       [ 175428.51217363,   62949.45505415],
       ...,
       [ -32111.17196162,   12098.59667675],
       [ -54039.34710613,   -5333.44877396],
       [-101645.96566859,   -1359.84321136]])

Classification

To train classification model again I used the factors that was significant based on earlier analysis. Same as last section, I will run several machine learning model first and then talk about them at the end.

In [106]:
df6 = dff.dropna()
df6 = df6[['imdbscores_bins', 'num_critic_for_reviews','duration','num_voted_users','num_user_for_reviews', 'movie_facebook_likes']];
y = df6['imdbscores_bins'] 

X = df6.drop(['imdbscores_bins'], axis =1)
In [107]:
df6 = df.dropna()
df6 = df6[['imdb_score', 'num_critic_for_reviews','duration','num_voted_users','num_user_for_reviews', 'movie_facebook_likes']];

# setting my own values for bins
df6['imdbscores_bins'] = pd.cut(df6['imdb_score'], bins=[0, 4, 6, 8, 10], labels=[1, 2, 3, 4],
                   include_lowest=True)
# see the result
df6.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 0 to 5042
Data columns (total 7 columns):
imdb_score                3723 non-null float64
num_critic_for_reviews    3723 non-null float64
duration                  3723 non-null float64
num_voted_users           3723 non-null int64
num_user_for_reviews      3723 non-null float64
movie_facebook_likes      3723 non-null int64
imdbscores_bins           3723 non-null category
dtypes: category(1), float64(4), int64(2)
memory usage: 207.4 KB
In [108]:
df6 = df6.drop(['imdb_score'], axis =1)
In [109]:
df6['imdbscores_bins'] = df6.imdbscores_bins.astype(int)

df6.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 0 to 5042
Data columns (total 6 columns):
num_critic_for_reviews    3723 non-null float64
duration                  3723 non-null float64
num_voted_users           3723 non-null int64
num_user_for_reviews      3723 non-null float64
movie_facebook_likes      3723 non-null int64
imdbscores_bins           3723 non-null int32
dtypes: float64(3), int32(1), int64(2)
memory usage: 189.1 KB
In [110]:
y = df6['imdbscores_bins'] 

X = df6.drop(['imdbscores_bins'], axis =1)

a) Decision Tree

In [111]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train);
In [112]:
print(metrics.accuracy_score(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, dt.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
#print(metrics.roc_auc_score(y_test, dt.predict(X_test)));
0.6141450313339302
--------------------------------------------------------
[[  5  17   8   0]
 [ 14 149 137   0]
 [ 20 191 505  18]
 [  0   3  23  27]]
--------------------------------------------------------
              precision    recall  f1-score   support

           1       0.13      0.17      0.14        30
           2       0.41      0.50      0.45       300
           3       0.75      0.69      0.72       734
           4       0.60      0.51      0.55        53

   micro avg       0.61      0.61      0.61      1117
   macro avg       0.47      0.47      0.47      1117
weighted avg       0.64      0.61      0.62      1117

--------------------------------------------------------
In [113]:
import scikitplot as skplt

skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test), y_pred=dt.predict(X_test))
plt.show()
In [114]:
from sklearn.externals.six import StringIO
import pydotplus

dot_data = StringIO() 
tree.export_graphviz(dt, out_file=dot_data, feature_names=X.columns,
                     filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
In [115]:
from graphviz import Source
from sklearn import tree
Source( tree.export_graphviz(dt, out_file=None, feature_names=X.columns))
Out[115]:
Tree 0 num_voted_users <= 108545.5 gini = 0.491 samples = 2606 value = [61, 741, 1701, 103] 1 duration <= 110.5 gini = 0.51 samples = 1858 value = [60, 691, 1100, 7] 0->1 True 1046 num_voted_users <= 532035.0 gini = 0.333 samples = 748 value = [1, 50, 601, 96] 0->1046 False 2 num_voted_users <= 47816.5 gini = 0.543 samples = 1243 value = [56, 565, 619, 3] 1->2 777 movie_facebook_likes <= 58500.0 gini = 0.346 samples = 615 value = [4, 126, 481, 4] 1->777 3 movie_facebook_likes <= 186.5 gini = 0.554 samples = 877 value = [51, 435, 389, 2] 2->3 570 num_user_for_reviews <= 204.0 gini = 0.479 samples = 366 value = [5, 130, 230, 1] 2->570 4 num_user_for_reviews <= 255.0 gini = 0.549 samples = 444 value = [26, 184, 233, 1] 3->4 319 num_user_for_reviews <= 156.5 gini = 0.531 samples = 433 value = [25, 251, 156, 1] 3->319 5 duration <= 102.5 gini = 0.534 samples = 387 value = [20, 148, 218, 1] 4->5 280 num_critic_for_reviews <= 95.0 gini = 0.521 samples = 57 value = [6, 36, 15, 0] 4->280 6 num_critic_for_reviews <= 182.0 gini = 0.558 samples = 276 value = [18, 116, 141, 1] 5->6 207 movie_facebook_likes <= 143.5 gini = 0.435 samples = 111 value = [2, 32, 77, 0] 5->207 7 num_critic_for_reviews <= 164.0 gini = 0.563 samples = 253 value = [17, 112, 123, 1] 6->7 194 num_voted_users <= 27278.5 gini = 0.355 samples = 23 value = [1, 4, 18, 0] 6->194 8 num_critic_for_reviews <= 157.0 gini = 0.564 samples = 242 value = [17, 103, 121, 1] 7->8 187 duration <= 75.0 gini = 0.298 samples = 11 value = [0, 9, 2, 0] 7->187 9 movie_facebook_likes <= 168.5 gini = 0.565 samples = 232 value = [16, 102, 113, 1] 8->9 182 num_voted_users <= 13522.0 gini = 0.34 samples = 10 value = [1, 1, 8, 0] 8->182 10 duration <= 99.5 gini = 0.566 samples = 228 value = [16, 102, 109, 1] 9->10 181 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 9->181 11 num_user_for_reviews <= 209.5 gini = 0.569 samples = 191 value = [15, 90, 86, 0] 10->11 156 num_user_for_reviews <= 168.5 gini = 0.507 samples = 37 value = [1, 12, 23, 1] 10->156 12 num_user_for_reviews <= 16.5 gini = 0.571 samples = 181 value = [15, 88, 78, 0] 11->12 151 num_voted_users <= 45965.5 gini = 0.32 samples = 10 value = [0, 2, 8, 0] 11->151 13 num_critic_for_reviews <= 22.5 gini = 0.406 samples = 8 value = [1, 1, 6, 0] 12->13 18 duration <= 98.5 gini = 0.567 samples = 173 value = [14, 87, 72, 0] 12->18 14 duration <= 86.5 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 13->14 17 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 13->17 15 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 14->15 16 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 14->16 19 num_voted_users <= 42270.0 gini = 0.57 samples = 161 value = [13, 78, 70, 0] 18->19 142 num_user_for_reviews <= 61.5 gini = 0.403 samples = 12 value = [1, 9, 2, 0] 18->142 20 num_critic_for_reviews <= 44.5 gini = 0.57 samples = 157 value = [13, 78, 66, 0] 19->20 141 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 19->141 21 num_critic_for_reviews <= 25.5 gini = 0.589 samples = 41 value = [5, 15, 21, 0] 20->21 52 num_user_for_reviews <= 65.5 gini = 0.55 samples = 116 value = [8, 63, 45, 0] 20->52 22 duration <= 87.5 gini = 0.521 samples = 13 value = [1, 8, 4, 0] 21->22 33 num_user_for_reviews <= 22.5 gini = 0.548 samples = 28 value = [4, 7, 17, 0] 21->33 23 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 22->23 24 num_user_for_reviews <= 58.5 gini = 0.43 samples = 11 value = [1, 8, 2, 0] 22->24 25 movie_facebook_likes <= 100.0 gini = 0.34 samples = 10 value = [1, 8, 1, 0] 24->25 32 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 24->32 26 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 25->26 27 num_voted_users <= 3340.5 gini = 0.625 samples = 4 value = [1, 2, 1, 0] 25->27 28 num_user_for_reviews <= 26.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 27->28 31 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 27->31 29 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 28->29 30 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 28->30 34 duration <= 95.0 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 33->34 37 num_user_for_reviews <= 68.5 gini = 0.506 samples = 26 value = [3, 6, 17, 0] 33->37 35 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 34->35 36 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 34->36 38 num_voted_users <= 13689.0 gini = 0.26 samples = 13 value = [0, 2, 11, 0] 37->38 43 num_voted_users <= 13027.5 gini = 0.639 samples = 13 value = [3, 4, 6, 0] 37->43 39 duration <= 96.5 gini = 0.153 samples = 12 value = [0, 1, 11, 0] 38->39 42 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 38->42 40 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 39->40 41 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 39->41 44 movie_facebook_likes <= 44.5 gini = 0.48 samples = 5 value = [2, 3, 0, 0] 43->44 47 duration <= 88.5 gini = 0.406 samples = 8 value = [1, 1, 6, 0] 43->47 45 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 44->45 46 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 44->46 48 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 47->48 49 num_user_for_reviews <= 79.5 gini = 0.245 samples = 7 value = [0, 1, 6, 0] 47->49 50 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 49->50 51 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 49->51 53 duration <= 81.0 gini = 0.405 samples = 23 value = [1, 17, 5, 0] 52->53 64 duration <= 87.5 gini = 0.565 samples = 93 value = [7, 46, 40, 0] 52->64 54 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 53->54 55 duration <= 97.5 gini = 0.265 samples = 20 value = [1, 17, 2, 0] 53->55 56 num_user_for_reviews <= 58.0 gini = 0.194 samples = 19 value = [1, 17, 1, 0] 55->56 63 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 55->63 57 gini = 0.0 samples = 13 value = [0, 13, 0, 0] 56->57 58 num_user_for_reviews <= 62.5 gini = 0.5 samples = 6 value = [1, 4, 1, 0] 56->58 59 num_user_for_reviews <= 61.0 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 58->59 62 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 58->62 60 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 59->60 61 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 59->61 65 num_user_for_reviews <= 184.5 gini = 0.49 samples = 21 value = [2, 14, 5, 0] 64->65 82 num_user_for_reviews <= 70.5 gini = 0.561 samples = 72 value = [5, 32, 35, 0] 64->82 66 num_voted_users <= 7045.5 gini = 0.305 samples = 16 value = [0, 13, 3, 0] 65->66 77 num_voted_users <= 23222.0 gini = 0.64 samples = 5 value = [2, 1, 2, 0] 65->77 67 num_critic_for_reviews <= 68.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 66->67 72 num_user_for_reviews <= 94.0 gini = 0.142 samples = 13 value = [0, 12, 1, 0] 66->72 68 num_voted_users <= 3612.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 67->68 71 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 67->71 69 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 68->69 70 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 68->70 73 num_user_for_reviews <= 86.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 72->73 76 gini = 0.0 samples = 10 value = [0, 10, 0, 0] 72->76 74 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 73->74 75 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 73->75 78 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 77->78 79 duration <= 79.5 gini = 0.444 samples = 3 value = [2, 1, 0, 0] 77->79 80 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 79->80 81 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 79->81 83 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 82->83 84 num_critic_for_reviews <= 45.5 gini = 0.565 samples = 69 value = [5, 32, 32, 0] 82->84 85 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 84->85 86 num_user_for_reviews <= 71.5 gini = 0.554 samples = 68 value = [4, 32, 32, 0] 84->86 87 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 86->87 88 num_voted_users <= 5217.0 gini = 0.555 samples = 66 value = [4, 30, 32, 0] 86->88 89 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 88->89 90 num_voted_users <= 17397.0 gini = 0.555 samples = 64 value = [4, 28, 32, 0] 88->90 91 num_user_for_reviews <= 90.0 gini = 0.498 samples = 15 value = [2, 3, 10, 0] 90->91 100 duration <= 96.5 gini = 0.536 samples = 49 value = [2, 25, 22, 0] 90->100 92 num_voted_users <= 11692.5 gini = 0.64 samples = 5 value = [2, 2, 1, 0] 91->92 97 num_critic_for_reviews <= 106.0 gini = 0.18 samples = 10 value = [0, 1, 9, 0] 91->97 93 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 92->93 94 duration <= 95.0 gini = 0.444 samples = 3 value = [2, 0, 1, 0] 92->94 95 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 94->95 96 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 94->96 98 gini = 0.0 samples = 9 value = [0, 0, 9, 0] 97->98 99 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 97->99 101 num_voted_users <= 17935.5 gini = 0.531 samples = 39 value = [2, 22, 15, 0] 100->101 134 num_voted_users <= 18213.5 gini = 0.42 samples = 10 value = [0, 3, 7, 0] 100->134 102 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 101->102 103 num_user_for_reviews <= 173.0 gini = 0.508 samples = 38 value = [1, 22, 15, 0] 101->103 104 num_voted_users <= 31200.0 gini = 0.463 samples = 33 value = [0, 21, 12, 0] 103->104 129 num_voted_users <= 34088.5 gini = 0.56 samples = 5 value = [1, 1, 3, 0] 103->129 105 num_user_for_reviews <= 81.5 gini = 0.363 samples = 21 value = [0, 16, 5, 0] 104->105 118 num_user_for_reviews <= 137.5 gini = 0.486 samples = 12 value = [0, 5, 7, 0] 104->118 106 num_voted_users <= 21861.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 105->106 109 num_user_for_reviews <= 148.5 gini = 0.278 samples = 18 value = [0, 15, 3, 0] 105->109 107 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 106->107 108 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 106->108 110 gini = 0.0 samples = 11 value = [0, 11, 0, 0] 109->110 111 duration <= 93.5 gini = 0.49 samples = 7 value = [0, 4, 3, 0] 109->111 112 num_voted_users <= 20647.0 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 111->112 117 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 111->117 113 duration <= 92.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 112->113 116 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 112->116 114 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 113->114 115 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 113->115 119 duration <= 90.0 gini = 0.42 samples = 10 value = [0, 3, 7, 0] 118->119 128 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 118->128 120 num_user_for_reviews <= 113.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 119->120 123 num_voted_users <= 36250.0 gini = 0.245 samples = 7 value = [0, 1, 6, 0] 119->123 121 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 120->121 122 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 120->122 124 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 123->124 125 num_user_for_reviews <= 79.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 123->125 126 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 125->126 127 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 125->127 130 num_critic_for_reviews <= 70.0 gini = 0.375 samples = 4 value = [1, 0, 3, 0] 129->130 133 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 129->133 131 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 130->131 132 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 130->132 135 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 134->135 136 num_user_for_reviews <= 159.5 gini = 0.346 samples = 9 value = [0, 2, 7, 0] 134->136 137 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 136->137 138 duration <= 97.5 gini = 0.5 samples = 4 value = [0, 2, 2, 0] 136->138 139 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 138->139 140 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 138->140 143 num_critic_for_reviews <= 71.5 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 142->143 146 num_user_for_reviews <= 166.5 gini = 0.18 samples = 10 value = [0, 9, 1, 0] 142->146 144 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 143->144 145 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 143->145 147 gini = 0.0 samples = 8 value = [0, 8, 0, 0] 146->147 148 num_voted_users <= 28201.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 146->148 149 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 148->149 150 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 148->150 152 num_critic_for_reviews <= 146.5 gini = 0.198 samples = 9 value = [0, 1, 8, 0] 151->152 155 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 151->155 153 gini = 0.0 samples = 8 value = [0, 0, 8, 0] 152->153 154 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 152->154 157 num_critic_for_reviews <= 58.5 gini = 0.436 samples = 27 value = [0, 7, 19, 1] 156->157 174 num_user_for_reviews <= 184.5 gini = 0.58 samples = 10 value = [1, 5, 4, 0] 156->174 158 num_voted_users <= 1316.5 gini = 0.496 samples = 11 value = [0, 5, 6, 0] 157->158 167 num_voted_users <= 5889.5 gini = 0.32 samples = 16 value = [0, 2, 13, 1] 157->167 159 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 158->159 160 num_user_for_reviews <= 36.0 gini = 0.469 samples = 8 value = [0, 5, 3, 0] 158->160 161 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 160->161 162 duration <= 101.0 gini = 0.48 samples = 5 value = [0, 2, 3, 0] 160->162 163 num_critic_for_reviews <= 45.5 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 162->163 166 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 162->166 164 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 163->164 165 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 163->165 168 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 167->168 169 num_critic_for_reviews <= 117.5 gini = 0.231 samples = 15 value = [0, 2, 13, 0] 167->169 170 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 169->170 171 num_critic_for_reviews <= 131.5 gini = 0.48 samples = 5 value = [0, 2, 3, 0] 169->171 172 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 171->172 173 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 171->173 175 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 174->175 176 num_voted_users <= 40621.5 gini = 0.571 samples = 7 value = [1, 2, 4, 0] 174->176 177 num_voted_users <= 6412.5 gini = 0.32 samples = 5 value = [1, 0, 4, 0] 176->177 180 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 176->180 178 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 177->178 179 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 177->179 183 num_user_for_reviews <= 118.5 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 182->183 186 gini = 0.0 samples = 8 value = [0, 0, 8, 0] 182->186 184 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 183->184 185 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 183->185 188 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 187->188 189 num_voted_users <= 23112.0 gini = 0.18 samples = 10 value = [0, 9, 1, 0] 187->189 190 num_critic_for_reviews <= 173.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 189->190 193 gini = 0.0 samples = 8 value = [0, 8, 0, 0] 189->193 191 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 190->191 192 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 190->192 195 num_user_for_reviews <= 146.5 gini = 0.594 samples = 8 value = [1, 3, 4, 0] 194->195 204 num_critic_for_reviews <= 191.0 gini = 0.124 samples = 15 value = [0, 1, 14, 0] 194->204 196 num_voted_users <= 20217.0 gini = 0.375 samples = 4 value = [1, 0, 3, 0] 195->196 199 num_critic_for_reviews <= 280.0 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 195->199 197 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 196->197 198 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 196->198 200 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 199->200 201 duration <= 95.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 199->201 202 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 201->202 203 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 201->203 205 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 204->205 206 gini = 0.0 samples = 14 value = [0, 0, 14, 0] 204->206 208 num_user_for_reviews <= 19.5 gini = 0.419 samples = 108 value = [2, 29, 77, 0] 207->208 279 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 207->279 209 num_voted_users <= 1854.0 gini = 0.444 samples = 3 value = [1, 2, 0, 0] 208->209 212 num_critic_for_reviews <= 162.5 gini = 0.396 samples = 105 value = [1, 27, 77, 0] 208->212 210 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 209->210 211 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 209->211 213 num_voted_users <= 22788.0 gini = 0.422 samples = 95 value = [1, 27, 67, 0] 212->213 278 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 212->278 214 num_voted_users <= 9078.0 gini = 0.352 samples = 63 value = [1, 13, 49, 0] 213->214 253 num_voted_users <= 41943.5 gini = 0.492 samples = 32 value = [0, 14, 18, 0] 213->253 215 num_voted_users <= 8438.5 gini = 0.482 samples = 28 value = [1, 9, 18, 0] 214->215 238 num_critic_for_reviews <= 137.5 gini = 0.202 samples = 35 value = [0, 4, 31, 0] 214->238 216 duration <= 109.5 gini = 0.447 samples = 26 value = [1, 7, 18, 0] 215->216 237 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 215->237 217 num_user_for_reviews <= 84.0 gini = 0.405 samples = 23 value = [1, 5, 17, 0] 216->217 234 num_critic_for_reviews <= 94.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 216->234 218 num_critic_for_reviews <= 33.0 gini = 0.335 samples = 20 value = [1, 3, 16, 0] 217->218 231 num_user_for_reviews <= 113.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 217->231 219 duration <= 103.5 gini = 0.512 samples = 11 value = [1, 3, 7, 0] 218->219 230 gini = 0.0 samples = 9 value = [0, 0, 9, 0] 218->230 220 num_critic_for_reviews <= 18.5 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 219->220 223 num_critic_for_reviews <= 30.0 gini = 0.346 samples = 9 value = [0, 2, 7, 0] 219->223 221 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 220->221 222 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 220->222 224 num_voted_users <= 6571.5 gini = 0.219 samples = 8 value = [0, 1, 7, 0] 223->224 229 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 223->229 225 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 224->225 226 duration <= 106.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 224->226 227 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 226->227 228 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 226->228 232 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 231->232 233 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 231->233 235 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 234->235 236 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 234->236 239 num_user_for_reviews <= 34.5 gini = 0.121 samples = 31 value = [0, 2, 29, 0] 238->239 248 duration <= 106.0 gini = 0.5 samples = 4 value = [0, 2, 2, 0] 238->248 240 num_user_for_reviews <= 31.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 239->240 243 num_voted_users <= 19472.5 gini = 0.067 samples = 29 value = [0, 1, 28, 0] 239->243 241 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 240->241 242 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 240->242 244 gini = 0.0 samples = 25 value = [0, 0, 25, 0] 243->244 245 num_user_for_reviews <= 148.0 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 243->245 246 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 245->246 247 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 245->247 249 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 248->249 250 num_user_for_reviews <= 72.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 248->250 251 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 250->251 252 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 250->252 254 num_voted_users <= 24243.0 gini = 0.497 samples = 26 value = [0, 14, 12, 0] 253->254 277 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 253->277 255 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 254->255 256 num_voted_users <= 25807.5 gini = 0.499 samples = 23 value = [0, 11, 12, 0] 254->256 257 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 256->257 258 num_voted_users <= 39501.0 gini = 0.495 samples = 20 value = [0, 11, 9, 0] 256->258 259 num_voted_users <= 37690.5 gini = 0.5 samples = 18 value = [0, 9, 9, 0] 258->259 276 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 258->276 260 num_user_for_reviews <= 111.0 gini = 0.492 samples = 16 value = [0, 9, 7, 0] 259->260 275 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 259->275 261 num_user_for_reviews <= 86.0 gini = 0.444 samples = 6 value = [0, 2, 4, 0] 260->261 266 num_user_for_reviews <= 152.0 gini = 0.42 samples = 10 value = [0, 7, 3, 0] 260->266 262 num_critic_for_reviews <= 97.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 261->262 265 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 261->265 263 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 262->263 264 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 262->264 267 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 266->267 268 num_user_for_reviews <= 155.0 gini = 0.5 samples = 6 value = [0, 3, 3, 0] 266->268 269 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 268->269 270 num_user_for_reviews <= 195.0 gini = 0.48 samples = 5 value = [0, 3, 2, 0] 268->270 271 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 270->271 272 duration <= 105.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 270->272 273 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 272->273 274 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 272->274 281 num_voted_users <= 34490.5 gini = 0.62 samples = 10 value = [5, 2, 3, 0] 280->281 292 duration <= 99.5 gini = 0.411 samples = 47 value = [1, 34, 12, 0] 280->292 282 num_user_for_reviews <= 294.0 gini = 0.531 samples = 8 value = [5, 2, 1, 0] 281->282 291 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 281->291 283 num_critic_for_reviews <= 55.5 gini = 0.444 samples = 3 value = [1, 2, 0, 0] 282->283 286 num_voted_users <= 25435.5 gini = 0.32 samples = 5 value = [4, 0, 1, 0] 282->286 284 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 283->284 285 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 283->285 287 gini = 0.0 samples = 3 value = [3, 0, 0, 0] 286->287 288 duration <= 94.5 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 286->288 289 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 288->289 290 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 288->290 293 num_critic_for_reviews <= 166.5 gini = 0.19 samples = 29 value = [1, 26, 2, 0] 292->293 306 num_voted_users <= 44652.5 gini = 0.494 samples = 18 value = [0, 8, 10, 0] 292->306 294 num_critic_for_reviews <= 111.5 gini = 0.095 samples = 20 value = [1, 19, 0, 0] 293->294 299 num_critic_for_reviews <= 174.0 gini = 0.346 samples = 9 value = [0, 7, 2, 0] 293->299 295 num_voted_users <= 37003.5 gini = 0.375 samples = 4 value = [1, 3, 0, 0] 294->295 298 gini = 0.0 samples = 16 value = [0, 16, 0, 0] 294->298 296 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 295->296 297 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 295->297 300 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 299->300 301 num_voted_users <= 40245.0 gini = 0.219 samples = 8 value = [0, 7, 1, 0] 299->301 302 num_voted_users <= 39337.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 301->302 305 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 301->305 303 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 302->303 304 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 302->304 307 num_voted_users <= 28100.0 gini = 0.444 samples = 15 value = [0, 5, 10, 0] 306->307 318 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 306->318 308 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 307->308 309 num_voted_users <= 39210.0 gini = 0.355 samples = 13 value = [0, 3, 10, 0] 307->309 310 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 309->310 311 num_user_for_reviews <= 348.0 gini = 0.5 samples = 6 value = [0, 3, 3, 0] 309->311 312 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 311->312 313 num_critic_for_reviews <= 132.0 gini = 0.48 samples = 5 value = [0, 2, 3, 0] 311->313 314 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 313->314 315 num_voted_users <= 44165.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 313->315 316 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 315->316 317 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 315->317 320 num_critic_for_reviews <= 107.5 gini = 0.525 samples = 325 value = [11, 180, 133, 1] 319->320 501 num_user_for_reviews <= 436.0 gini = 0.506 samples = 108 value = [14, 71, 23, 0] 319->501 321 num_voted_users <= 22822.0 gini = 0.512 samples = 267 value = [11, 159, 97, 0] 320->321 470 num_critic_for_reviews <= 124.5 gini = 0.483 samples = 58 value = [0, 21, 36, 1] 320->470 322 movie_facebook_likes <= 218.0 gini = 0.526 samples = 239 value = [10, 134, 95, 0] 321->322 461 num_critic_for_reviews <= 19.0 gini = 0.196 samples = 28 value = [1, 25, 2, 0] 321->461 323 gini = 0.0 samples = 10 value = [0, 10, 0, 0] 322->323 324 num_voted_users <= 7273.5 gini = 0.533 samples = 229 value = [10, 124, 95, 0] 322->324 325 num_user_for_reviews <= 46.5 gini = 0.536 samples = 93 value = [4, 40, 49, 0] 324->325 378 movie_facebook_likes <= 617.5 gini = 0.502 samples = 136 value = [6, 84, 46, 0] 324->378 326 movie_facebook_likes <= 313.5 gini = 0.512 samples = 53 value = [1, 29, 23, 0] 325->326 355 num_user_for_reviews <= 59.5 gini = 0.496 samples = 40 value = [3, 11, 26, 0] 325->355 327 num_critic_for_reviews <= 34.5 gini = 0.444 samples = 15 value = [0, 5, 10, 0] 326->327 336 movie_facebook_likes <= 388.0 gini = 0.483 samples = 38 value = [1, 24, 13, 0] 326->336 328 movie_facebook_likes <= 274.0 gini = 0.494 samples = 9 value = [0, 5, 4, 0] 327->328 335 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 327->335 329 num_voted_users <= 3875.5 gini = 0.408 samples = 7 value = [0, 5, 2, 0] 328->329 334 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 328->334 330 num_critic_for_reviews <= 14.0 gini = 0.5 samples = 4 value = [0, 2, 2, 0] 329->330 333 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 329->333 331 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 330->331 332 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 330->332 337 gini = 0.0 samples = 9 value = [0, 9, 0, 0] 336->337 338 movie_facebook_likes <= 544.5 gini = 0.53 samples = 29 value = [1, 15, 13, 0] 336->338 339 num_critic_for_reviews <= 7.0 gini = 0.278 samples = 6 value = [0, 1, 5, 0] 338->339 342 num_voted_users <= 5052.0 gini = 0.507 samples = 23 value = [1, 14, 8, 0] 338->342 340 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 339->340 341 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 339->341 343 duration <= 102.5 gini = 0.549 samples = 18 value = [1, 9, 8, 0] 342->343 354 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 342->354 344 movie_facebook_likes <= 568.5 gini = 0.473 samples = 13 value = [0, 5, 8, 0] 343->344 351 movie_facebook_likes <= 570.5 gini = 0.32 samples = 5 value = [1, 4, 0, 0] 343->351 345 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 344->345 346 movie_facebook_likes <= 874.5 gini = 0.397 samples = 11 value = [0, 3, 8, 0] 344->346 347 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 346->347 348 num_critic_for_reviews <= 25.0 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 346->348 349 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 348->349 350 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 348->350 352 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 351->352 353 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 351->353 356 duration <= 78.0 gini = 0.142 samples = 13 value = [1, 0, 12, 0] 355->356 361 num_voted_users <= 3951.5 gini = 0.56 samples = 27 value = [2, 11, 14, 0] 355->361 357 duration <= 75.5 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 356->357 360 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 356->360 358 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 357->358 359 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 357->359 362 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 361->362 363 num_critic_for_reviews <= 66.5 gini = 0.493 samples = 25 value = [0, 11, 14, 0] 361->363 364 movie_facebook_likes <= 307.0 gini = 0.495 samples = 20 value = [0, 11, 9, 0] 363->364 377 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 363->377 365 num_voted_users <= 5476.0 gini = 0.278 samples = 6 value = [0, 1, 5, 0] 364->365 370 movie_facebook_likes <= 554.0 gini = 0.408 samples = 14 value = [0, 10, 4, 0] 364->370 366 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 365->366 367 num_critic_for_reviews <= 43.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 365->367 368 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 367->368 369 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 367->369 371 num_voted_users <= 4331.0 gini = 0.198 samples = 9 value = [0, 8, 1, 0] 370->371 374 num_user_for_reviews <= 79.5 gini = 0.48 samples = 5 value = [0, 2, 3, 0] 370->374 372 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 371->372 373 gini = 0.0 samples = 8 value = [0, 8, 0, 0] 371->373 375 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 374->375 376 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 374->376 379 duration <= 95.5 gini = 0.429 samples = 78 value = [4, 56, 18, 0] 378->379 422 num_user_for_reviews <= 68.5 gini = 0.533 samples = 58 value = [2, 28, 28, 0] 378->422 380 num_critic_for_reviews <= 41.5 gini = 0.345 samples = 49 value = [4, 39, 6, 0] 379->380 405 num_voted_users <= 11601.0 gini = 0.485 samples = 29 value = [0, 17, 12, 0] 379->405 381 movie_facebook_likes <= 396.5 gini = 0.594 samples = 8 value = [3, 4, 1, 0] 380->381 390 num_user_for_reviews <= 97.5 gini = 0.256 samples = 41 value = [1, 35, 5, 0] 380->390 382 movie_facebook_likes <= 324.0 gini = 0.444 samples = 3 value = [2, 0, 1, 0] 381->382 385 movie_facebook_likes <= 484.5 gini = 0.32 samples = 5 value = [1, 4, 0, 0] 381->385 383 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 382->383 384 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 382->384 386 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 385->386 387 num_user_for_reviews <= 45.5 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 385->387 388 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 387->388 389 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 387->389 391 num_user_for_reviews <= 95.0 gini = 0.381 samples = 25 value = [1, 19, 5, 0] 390->391 404 gini = 0.0 samples = 16 value = [0, 16, 0, 0] 390->404 392 num_voted_users <= 11897.0 gini = 0.344 samples = 24 value = [1, 19, 4, 0] 391->392 403 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 391->403 393 num_user_for_reviews <= 78.5 gini = 0.124 samples = 15 value = [0, 14, 1, 0] 392->393 398 num_voted_users <= 13321.5 gini = 0.568 samples = 9 value = [1, 5, 3, 0] 392->398 394 gini = 0.0 samples = 11 value = [0, 11, 0, 0] 393->394 395 num_user_for_reviews <= 85.0 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 393->395 396 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 395->396 397 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 395->397 399 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 398->399 400 duration <= 89.0 gini = 0.278 samples = 6 value = [1, 5, 0, 0] 398->400 401 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 400->401 402 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 400->402 406 movie_facebook_likes <= 459.5 gini = 0.473 samples = 13 value = [0, 5, 8, 0] 405->406 415 num_voted_users <= 16283.0 gini = 0.375 samples = 16 value = [0, 12, 4, 0] 405->415 407 duration <= 103.5 gini = 0.49 samples = 7 value = [0, 4, 3, 0] 406->407 412 num_user_for_reviews <= 48.5 gini = 0.278 samples = 6 value = [0, 1, 5, 0] 406->412 408 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 407->408 409 num_user_for_reviews <= 86.5 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 407->409 410 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 409->410 411 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 409->411 413 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 412->413 414 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 412->414 416 duration <= 97.0 gini = 0.245 samples = 14 value = [0, 12, 2, 0] 415->416 421 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 415->421 417 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 416->417 418 num_critic_for_reviews <= 95.0 gini = 0.142 samples = 13 value = [0, 12, 1, 0] 416->418 419 gini = 0.0 samples = 12 value = [0, 12, 0, 0] 418->419 420 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 418->420 423 num_voted_users <= 17752.5 gini = 0.403 samples = 12 value = [1, 9, 2, 0] 422->423 430 movie_facebook_likes <= 946.0 gini = 0.509 samples = 46 value = [1, 19, 26, 0] 422->430 424 duration <= 89.0 gini = 0.198 samples = 9 value = [1, 8, 0, 0] 423->424 427 num_user_for_reviews <= 63.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 423->427 425 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 424->425 426 gini = 0.0 samples = 8 value = [0, 8, 0, 0] 424->426 428 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 427->428 429 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 427->429 431 duration <= 88.5 gini = 0.527 samples = 33 value = [1, 17, 15, 0] 430->431 454 num_critic_for_reviews <= 42.0 gini = 0.26 samples = 13 value = [0, 2, 11, 0] 430->454 432 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 431->432 433 num_critic_for_reviews <= 57.5 gini = 0.503 samples = 28 value = [1, 17, 10, 0] 431->433 434 movie_facebook_likes <= 814.0 gini = 0.542 samples = 12 value = [1, 4, 7, 0] 433->434 445 num_user_for_reviews <= 152.5 gini = 0.305 samples = 16 value = [0, 13, 3, 0] 433->445 435 movie_facebook_likes <= 710.5 gini = 0.49 samples = 7 value = [0, 4, 3, 0] 434->435 442 num_critic_for_reviews <= 38.5 gini = 0.32 samples = 5 value = [1, 0, 4, 0] 434->442 436 movie_facebook_likes <= 655.5 gini = 0.48 samples = 5 value = [0, 2, 3, 0] 435->436 441 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 435->441 437 movie_facebook_likes <= 628.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 436->437 440 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 436->440 438 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 437->438 439 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 437->439 443 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 442->443 444 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 442->444 446 num_critic_for_reviews <= 103.0 gini = 0.231 samples = 15 value = [0, 13, 2, 0] 445->446 453 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 445->453 447 duration <= 103.5 gini = 0.133 samples = 14 value = [0, 13, 1, 0] 446->447 452 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 446->452 448 gini = 0.0 samples = 12 value = [0, 12, 0, 0] 447->448 449 num_critic_for_reviews <= 78.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 447->449 450 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 449->450 451 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 449->451 455 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 454->455 456 duration <= 92.5 gini = 0.153 samples = 12 value = [0, 1, 11, 0] 454->456 457 duration <= 90.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 456->457 460 gini = 0.0 samples = 9 value = [0, 0, 9, 0] 456->460 458 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 457->458 459 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 457->459 462 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 461->462 463 movie_facebook_likes <= 878.0 gini = 0.137 samples = 27 value = [0, 25, 2, 0] 461->463 464 gini = 0.0 samples = 15 value = [0, 15, 0, 0] 463->464 465 movie_facebook_likes <= 989.5 gini = 0.278 samples = 12 value = [0, 10, 2, 0] 463->465 466 num_voted_users <= 27614.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 465->466 469 gini = 0.0 samples = 9 value = [0, 9, 0, 0] 465->469 467 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 466->467 468 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 466->468 471 duration <= 107.0 gini = 0.198 samples = 18 value = [0, 2, 16, 0] 470->471 480 num_voted_users <= 39302.0 gini = 0.524 samples = 40 value = [0, 19, 20, 1] 470->480 472 num_voted_users <= 30947.5 gini = 0.117 samples = 16 value = [0, 1, 15, 0] 471->472 477 num_user_for_reviews <= 80.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 471->477 473 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 472->473 474 num_critic_for_reviews <= 118.5 gini = 0.32 samples = 5 value = [0, 1, 4, 0] 472->474 475 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 474->475 476 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 474->476 478 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 477->478 479 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 477->479 481 num_voted_users <= 11313.0 gini = 0.506 samples = 32 value = [0, 19, 12, 1] 480->481 500 gini = 0.0 samples = 8 value = [0, 0, 8, 0] 480->500 482 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 481->482 483 num_critic_for_reviews <= 162.0 gini = 0.487 samples = 30 value = [0, 19, 10, 1] 481->483 484 num_voted_users <= 27164.0 gini = 0.305 samples = 16 value = [0, 13, 3, 0] 483->484 493 num_user_for_reviews <= 125.5 gini = 0.561 samples = 14 value = [0, 6, 7, 1] 483->493 485 num_voted_users <= 25087.0 gini = 0.49 samples = 7 value = [0, 4, 3, 0] 484->485 492 gini = 0.0 samples = 9 value = [0, 9, 0, 0] 484->492 486 duration <= 91.0 gini = 0.32 samples = 5 value = [0, 4, 1, 0] 485->486 491 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 485->491 487 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 486->487 488 num_voted_users <= 19483.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 486->488 489 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 488->489 490 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 488->490 494 movie_facebook_likes <= 17500.0 gini = 0.278 samples = 6 value = [0, 0, 5, 1] 493->494 497 duration <= 90.0 gini = 0.375 samples = 8 value = [0, 6, 2, 0] 493->497 495 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 494->495 496 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 494->496 498 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 497->498 499 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 497->499 502 num_critic_for_reviews <= 103.0 gini = 0.484 samples = 105 value = [11, 71, 23, 0] 501->502 569 gini = 0.0 samples = 3 value = [3, 0, 0, 0] 501->569 503 num_voted_users <= 7340.0 gini = 0.401 samples = 40 value = [7, 30, 3, 0] 502->503 532 num_user_for_reviews <= 275.0 gini = 0.504 samples = 65 value = [4, 41, 20, 0] 502->532 504 num_critic_for_reviews <= 47.5 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 503->504 507 num_voted_users <= 34152.5 gini = 0.349 samples = 38 value = [6, 30, 2, 0] 503->507 505 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 504->505 506 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 504->506 508 duration <= 80.5 gini = 0.275 samples = 32 value = [3, 27, 2, 0] 507->508 527 duration <= 89.0 gini = 0.5 samples = 6 value = [3, 3, 0, 0] 507->527 509 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 508->509 510 movie_facebook_likes <= 1500.0 gini = 0.233 samples = 31 value = [2, 27, 2, 0] 508->510 511 duration <= 91.5 gini = 0.19 samples = 29 value = [2, 26, 1, 0] 510->511 524 num_user_for_reviews <= 170.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 510->524 512 num_voted_users <= 8427.0 gini = 0.32 samples = 10 value = [2, 8, 0, 0] 511->512 519 movie_facebook_likes <= 848.0 gini = 0.1 samples = 19 value = [0, 18, 1, 0] 511->519 513 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 512->513 514 num_critic_for_reviews <= 73.0 gini = 0.198 samples = 9 value = [1, 8, 0, 0] 512->514 515 num_voted_users <= 12734.5 gini = 0.444 samples = 3 value = [1, 2, 0, 0] 514->515 518 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 514->518 516 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 515->516 517 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 515->517 520 gini = 0.0 samples = 12 value = [0, 12, 0, 0] 519->520 521 movie_facebook_likes <= 894.5 gini = 0.245 samples = 7 value = [0, 6, 1, 0] 519->521 522 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 521->522 523 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 521->523 525 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 524->525 526 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 524->526 528 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 527->528 529 movie_facebook_likes <= 612.5 gini = 0.375 samples = 4 value = [3, 1, 0, 0] 527->529 530 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 529->530 531 gini = 0.0 samples = 3 value = [3, 0, 0, 0] 529->531 533 num_voted_users <= 44517.0 gini = 0.529 samples = 45 value = [2, 25, 18, 0] 532->533 562 num_critic_for_reviews <= 148.0 gini = 0.34 samples = 20 value = [2, 16, 2, 0] 532->562 534 num_voted_users <= 40148.0 gini = 0.547 samples = 39 value = [2, 19, 18, 0] 533->534 561 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 533->561 535 duration <= 107.5 gini = 0.538 samples = 34 value = [2, 19, 13, 0] 534->535 560 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 534->560 536 duration <= 99.5 gini = 0.516 samples = 31 value = [2, 19, 10, 0] 535->536 559 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 535->559 537 duration <= 92.5 gini = 0.595 samples = 17 value = [2, 8, 7, 0] 536->537 550 duration <= 102.5 gini = 0.337 samples = 14 value = [0, 11, 3, 0] 536->550 538 num_voted_users <= 36313.0 gini = 0.531 samples = 8 value = [2, 5, 1, 0] 537->538 545 num_user_for_reviews <= 196.0 gini = 0.444 samples = 9 value = [0, 3, 6, 0] 537->545 539 num_user_for_reviews <= 257.5 gini = 0.278 samples = 6 value = [1, 5, 0, 0] 538->539 542 num_voted_users <= 38295.0 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 538->542 540 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 539->540 541 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 539->541 543 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 542->543 544 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 542->544 546 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 545->546 547 num_user_for_reviews <= 232.5 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 545->547 548 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 547->548 549 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 547->549 551 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 550->551 552 duration <= 104.5 gini = 0.469 samples = 8 value = [0, 5, 3, 0] 550->552 553 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 552->553 554 num_user_for_reviews <= 193.0 gini = 0.278 samples = 6 value = [0, 5, 1, 0] 552->554 555 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 554->555 556 movie_facebook_likes <= 900.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 554->556 557 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 556->557 558 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 556->558 563 gini = 0.0 samples = 12 value = [0, 12, 0, 0] 562->563 564 num_voted_users <= 45410.5 gini = 0.625 samples = 8 value = [2, 4, 2, 0] 562->564 565 duration <= 95.0 gini = 0.444 samples = 6 value = [2, 4, 0, 0] 564->565 568 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 564->568 566 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 565->566 567 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 565->567 571 num_user_for_reviews <= 74.0 gini = 0.391 samples = 152 value = [1, 39, 112, 0] 570->571 656 num_user_for_reviews <= 213.5 gini = 0.515 samples = 214 value = [4, 91, 118, 1] 570->656 572 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 571->572 573 num_critic_for_reviews <= 191.5 gini = 0.382 samples = 150 value = [1, 37, 112, 0] 571->573 574 num_voted_users <= 49673.5 gini = 0.428 samples = 109 value = [1, 32, 76, 0] 573->574 641 duration <= 106.5 gini = 0.214 samples = 41 value = [0, 5, 36, 0] 573->641 575 num_critic_for_reviews <= 64.5 gini = 0.494 samples = 9 value = [0, 5, 4, 0] 574->575 582 num_critic_for_reviews <= 183.0 gini = 0.409 samples = 100 value = [1, 27, 72, 0] 574->582 576 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 575->576 577 movie_facebook_likes <= 1443.0 gini = 0.408 samples = 7 value = [0, 5, 2, 0] 575->577 578 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 577->578 579 duration <= 107.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 577->579 580 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 579->580 581 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 579->581 583 num_voted_users <= 94303.5 gini = 0.393 samples = 95 value = [1, 24, 70, 0] 582->583 636 num_user_for_reviews <= 134.0 gini = 0.48 samples = 5 value = [0, 3, 2, 0] 582->636 584 num_voted_users <= 85446.5 gini = 0.41 samples = 89 value = [1, 24, 64, 0] 583->584 635 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 583->635 585 num_voted_users <= 69526.0 gini = 0.368 samples = 80 value = [1, 18, 61, 0] 584->585 626 movie_facebook_likes <= 15500.0 gini = 0.444 samples = 9 value = [0, 6, 3, 0] 584->626 586 num_voted_users <= 68056.5 gini = 0.427 samples = 54 value = [1, 15, 38, 0] 585->586 617 num_user_for_reviews <= 186.5 gini = 0.204 samples = 26 value = [0, 3, 23, 0] 585->617 587 movie_facebook_likes <= 334.5 gini = 0.389 samples = 51 value = [1, 12, 38, 0] 586->587 616 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 586->616 588 num_critic_for_reviews <= 39.0 gini = 0.238 samples = 29 value = [0, 4, 25, 0] 587->588 601 duration <= 101.5 gini = 0.517 samples = 22 value = [1, 8, 13, 0] 587->601 589 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 588->589 590 num_voted_users <= 60741.0 gini = 0.191 samples = 28 value = [0, 3, 25, 0] 588->590 591 duration <= 92.5 gini = 0.087 samples = 22 value = [0, 1, 21, 0] 590->591 596 num_user_for_reviews <= 115.0 gini = 0.444 samples = 6 value = [0, 2, 4, 0] 590->596 592 duration <= 91.5 gini = 0.245 samples = 7 value = [0, 1, 6, 0] 591->592 595 gini = 0.0 samples = 15 value = [0, 0, 15, 0] 591->595 593 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 592->593 594 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 592->594 597 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 596->597 598 duration <= 100.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 596->598 599 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 598->599 600 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 598->600 602 num_critic_for_reviews <= 117.5 gini = 0.379 samples = 13 value = [1, 2, 10, 0] 601->602 611 num_voted_users <= 56935.5 gini = 0.444 samples = 9 value = [0, 6, 3, 0] 601->611 603 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 602->603 604 num_critic_for_reviews <= 156.5 gini = 0.571 samples = 7 value = [1, 2, 4, 0] 602->604 605 num_voted_users <= 54215.5 gini = 0.625 samples = 4 value = [1, 2, 1, 0] 604->605 610 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 604->610 606 num_voted_users <= 52173.5 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 605->606 609 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 605->609 607 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 606->607 608 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 606->608 612 duration <= 102.5 gini = 0.48 samples = 5 value = [0, 2, 3, 0] 611->612 615 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 611->615 613 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 612->613 614 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 612->614 618 num_user_for_reviews <= 130.5 gini = 0.147 samples = 25 value = [0, 2, 23, 0] 617->618 625 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 617->625 619 num_user_for_reviews <= 118.0 gini = 0.32 samples = 10 value = [0, 2, 8, 0] 618->619 624 gini = 0.0 samples = 15 value = [0, 0, 15, 0] 618->624 620 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 619->620 621 num_critic_for_reviews <= 166.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 619->621 622 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 621->622 623 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 621->623 627 num_user_for_reviews <= 185.0 gini = 0.375 samples = 8 value = [0, 6, 2, 0] 626->627 634 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 626->634 628 num_critic_for_reviews <= 51.0 gini = 0.245 samples = 7 value = [0, 6, 1, 0] 627->628 633 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 627->633 629 num_user_for_reviews <= 112.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 628->629 632 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 628->632 630 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 629->630 631 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 629->631 637 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 636->637 638 movie_facebook_likes <= 8000.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 636->638 639 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 638->639 640 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 638->640 642 num_critic_for_reviews <= 258.0 gini = 0.153 samples = 36 value = [0, 3, 33, 0] 641->642 651 num_user_for_reviews <= 160.0 gini = 0.48 samples = 5 value = [0, 2, 3, 0] 641->651 643 num_critic_for_reviews <= 223.0 gini = 0.071 samples = 27 value = [0, 1, 26, 0] 642->643 648 movie_facebook_likes <= 16000.0 gini = 0.346 samples = 9 value = [0, 2, 7, 0] 642->648 644 gini = 0.0 samples = 17 value = [0, 0, 17, 0] 643->644 645 num_critic_for_reviews <= 225.0 gini = 0.18 samples = 10 value = [0, 1, 9, 0] 643->645 646 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 645->646 647 gini = 0.0 samples = 9 value = [0, 0, 9, 0] 645->647 649 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 648->649 650 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 648->650 652 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 651->652 653 num_user_for_reviews <= 178.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 651->653 654 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 653->654 655 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 653->655 657 duration <= 87.5 gini = 0.36 samples = 17 value = [0, 13, 4, 0] 656->657 668 num_user_for_reviews <= 223.5 gini = 0.508 samples = 197 value = [4, 78, 114, 1] 656->668 658 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 657->658 659 num_voted_users <= 104987.0 gini = 0.305 samples = 16 value = [0, 13, 3, 0] 657->659 660 num_voted_users <= 59184.0 gini = 0.231 samples = 15 value = [0, 13, 2, 0] 659->660 667 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 659->667 661 duration <= 91.0 gini = 0.444 samples = 6 value = [0, 4, 2, 0] 660->661 666 gini = 0.0 samples = 9 value = [0, 9, 0, 0] 660->666 662 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 661->662 663 movie_facebook_likes <= 7000.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 661->663 664 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 663->664 665 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 663->665 669 gini = 0.0 samples = 13 value = [0, 0, 13, 0] 668->669 670 num_critic_for_reviews <= 278.5 gini = 0.518 samples = 184 value = [4, 78, 101, 1] 668->670 671 num_critic_for_reviews <= 197.5 gini = 0.531 samples = 150 value = [4, 70, 75, 1] 670->671 762 num_user_for_reviews <= 327.5 gini = 0.36 samples = 34 value = [0, 8, 26, 0] 670->762 672 duration <= 96.5 gini = 0.518 samples = 104 value = [3, 40, 60, 1] 671->672 737 duration <= 85.5 gini = 0.468 samples = 46 value = [1, 30, 15, 0] 671->737 673 num_voted_users <= 83655.5 gini = 0.543 samples = 38 value = [2, 20, 16, 0] 672->673 702 num_voted_users <= 98711.5 gini = 0.463 samples = 66 value = [1, 20, 44, 1] 672->702 674 num_critic_for_reviews <= 146.0 gini = 0.458 samples = 25 value = [1, 17, 7, 0] 673->674 693 num_voted_users <= 103404.5 gini = 0.462 samples = 13 value = [1, 3, 9, 0] 673->693 675 num_user_for_reviews <= 310.5 gini = 0.551 samples = 15 value = [1, 8, 6, 0] 674->675 688 num_voted_users <= 59822.5 gini = 0.18 samples = 10 value = [0, 9, 1, 0] 674->688 676 duration <= 95.5 gini = 0.346 samples = 9 value = [0, 7, 2, 0] 675->676 683 num_voted_users <= 73792.0 gini = 0.5 samples = 6 value = [1, 1, 4, 0] 675->683 677 num_critic_for_reviews <= 94.5 gini = 0.219 samples = 8 value = [0, 7, 1, 0] 676->677 682 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 676->682 678 num_critic_for_reviews <= 89.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 677->678 681 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 677->681 679 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 678->679 680 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 678->680 684 num_critic_for_reviews <= 101.0 gini = 0.32 samples = 5 value = [0, 1, 4, 0] 683->684 687 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 683->687 685 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 684->685 686 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 684->686 689 num_critic_for_reviews <= 168.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 688->689 692 gini = 0.0 samples = 7 value = [0, 7, 0, 0] 688->692 690 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 689->690 691 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 689->691 694 num_user_for_reviews <= 392.0 gini = 0.314 samples = 11 value = [1, 1, 9, 0] 693->694 701 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 693->701 695 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 694->695 696 num_user_for_reviews <= 711.0 gini = 0.625 samples = 4 value = [1, 1, 2, 0] 694->696 697 movie_facebook_likes <= 500.0 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 696->697 700 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 696->700 698 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 697->698 699 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 697->699 703 num_user_for_reviews <= 245.0 gini = 0.415 samples = 55 value = [1, 13, 40, 1] 702->703 730 num_user_for_reviews <= 484.0 gini = 0.463 samples = 11 value = [0, 7, 4, 0] 702->730 704 movie_facebook_likes <= 1000.0 gini = 0.32 samples = 5 value = [0, 4, 1, 0] 703->704 707 num_critic_for_reviews <= 80.0 gini = 0.358 samples = 50 value = [1, 9, 39, 1] 703->707 705 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 704->705 706 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 704->706 708 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 707->708 709 num_voted_users <= 71730.5 gini = 0.339 samples = 49 value = [1, 8, 39, 1] 707->709 710 num_voted_users <= 60699.5 gini = 0.486 samples = 27 value = [1, 7, 18, 1] 709->710 725 duration <= 106.5 gini = 0.087 samples = 22 value = [0, 1, 21, 0] 709->725 711 num_user_for_reviews <= 499.5 gini = 0.277 samples = 19 value = [1, 2, 16, 0] 710->711 718 num_user_for_reviews <= 327.5 gini = 0.531 samples = 8 value = [0, 5, 2, 1] 710->718 712 num_critic_for_reviews <= 151.0 gini = 0.198 samples = 18 value = [0, 2, 16, 0] 711->712 717 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 711->717 713 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 712->713 714 num_user_for_reviews <= 320.5 gini = 0.408 samples = 7 value = [0, 2, 5, 0] 712->714 715 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 714->715 716 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 714->716 719 duration <= 104.0 gini = 0.625 samples = 4 value = [0, 1, 2, 1] 718->719 724 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 718->724 720 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 719->720 721 num_user_for_reviews <= 260.0 gini = 0.5 samples = 2 value = [0, 1, 0, 1] 719->721 722 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 721->722 723 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 721->723 726 gini = 0.0 samples = 15 value = [0, 0, 15, 0] 725->726 727 num_critic_for_reviews <= 153.0 gini = 0.245 samples = 7 value = [0, 1, 6, 0] 725->727 728 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 727->728 729 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 727->729 731 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 730->731 732 duration <= 103.0 gini = 0.444 samples = 6 value = [0, 2, 4, 0] 730->732 733 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 732->733 734 num_user_for_reviews <= 543.5 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 732->734 735 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 734->735 736 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 734->736 738 num_user_for_reviews <= 298.5 gini = 0.278 samples = 6 value = [0, 1, 5, 0] 737->738 743 duration <= 108.5 gini = 0.411 samples = 40 value = [1, 29, 10, 0] 737->743 739 duration <= 83.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 738->739 742 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 738->742 740 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 739->740 741 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 739->741 744 num_user_for_reviews <= 289.5 gini = 0.337 samples = 34 value = [1, 27, 6, 0] 743->744 759 num_user_for_reviews <= 317.0 gini = 0.444 samples = 6 value = [0, 2, 4, 0] 743->759 745 movie_facebook_likes <= 9000.0 gini = 0.496 samples = 11 value = [0, 6, 5, 0] 744->745 750 duration <= 88.5 gini = 0.163 samples = 23 value = [1, 21, 1, 0] 744->750 746 num_critic_for_reviews <= 202.0 gini = 0.278 samples = 6 value = [0, 1, 5, 0] 745->746 749 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 745->749 747 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 746->747 748 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 746->748 751 num_critic_for_reviews <= 214.0 gini = 0.5 samples = 2 value = [1, 1, 0, 0] 750->751 754 num_voted_users <= 65950.5 gini = 0.091 samples = 21 value = [0, 20, 1, 0] 750->754 752 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 751->752 753 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 751->753 755 num_critic_for_reviews <= 242.0 gini = 0.375 samples = 4 value = [0, 3, 1, 0] 754->755 758 gini = 0.0 samples = 17 value = [0, 17, 0, 0] 754->758 756 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 755->756 757 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 755->757 760 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 759->760 761 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 759->761 763 movie_facebook_likes <= 6500.0 gini = 0.095 samples = 20 value = [0, 1, 19, 0] 762->763 768 movie_facebook_likes <= 41000.0 gini = 0.5 samples = 14 value = [0, 7, 7, 0] 762->768 764 num_voted_users <= 92992.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 763->764 767 gini = 0.0 samples = 17 value = [0, 0, 17, 0] 763->767 765 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 764->765 766 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 764->766 769 num_critic_for_reviews <= 347.5 gini = 0.463 samples = 11 value = [0, 7, 4, 0] 768->769 776 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 768->776 770 num_critic_for_reviews <= 309.0 gini = 0.444 samples = 6 value = [0, 2, 4, 0] 769->770 775 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 769->775 771 duration <= 95.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 770->771 774 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 770->774 772 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 771->772 773 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 771->773 778 num_voted_users <= 3161.0 gini = 0.338 samples = 610 value = [3, 122, 481, 4] 777->778 1043 duration <= 115.5 gini = 0.32 samples = 5 value = [1, 4, 0, 0] 777->1043 779 duration <= 124.5 gini = 0.497 samples = 26 value = [0, 12, 14, 0] 778->779 798 num_user_for_reviews <= 623.5 gini = 0.325 samples = 584 value = [3, 110, 467, 4] 778->798 780 duration <= 119.5 gini = 0.488 samples = 19 value = [0, 11, 8, 0] 779->780 793 num_voted_users <= 2295.5 gini = 0.245 samples = 7 value = [0, 1, 6, 0] 779->793 781 num_critic_for_reviews <= 10.5 gini = 0.498 samples = 15 value = [0, 7, 8, 0] 780->781 792 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 780->792 782 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 781->782 783 num_critic_for_reviews <= 17.5 gini = 0.473 samples = 13 value = [0, 5, 8, 0] 781->783 784 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 783->784 785 num_critic_for_reviews <= 25.5 gini = 0.469 samples = 8 value = [0, 5, 3, 0] 783->785 786 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 785->786 787 movie_facebook_likes <= 164.0 gini = 0.48 samples = 5 value = [0, 2, 3, 0] 785->787 788 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 787->788 789 duration <= 116.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 787->789 790 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 789->790 791 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 789->791 794 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 793->794 795 num_critic_for_reviews <= 49.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 793->795 796 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 795->796 797 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 795->797 799 num_voted_users <= 63585.0 gini = 0.316 samples = 570 value = [3, 103, 460, 4] 798->799 1032 duration <= 125.5 gini = 0.5 samples = 14 value = [0, 7, 7, 0] 798->1032 800 num_user_for_reviews <= 350.5 gini = 0.354 samples = 411 value = [3, 86, 319, 3] 799->800 993 num_critic_for_reviews <= 148.5 gini = 0.202 samples = 159 value = [0, 17, 141, 1] 799->993 801 duration <= 124.5 gini = 0.33 samples = 380 value = [1, 74, 302, 3] 800->801 972 num_user_for_reviews <= 492.5 gini = 0.545 samples = 31 value = [2, 12, 17, 0] 800->972 802 num_user_for_reviews <= 68.5 gini = 0.375 samples = 232 value = [0, 58, 174, 0] 801->802 917 num_user_for_reviews <= 173.5 gini = 0.24 samples = 148 value = [1, 16, 128, 3] 801->917 803 duration <= 112.5 gini = 0.49 samples = 35 value = [0, 15, 20, 0] 802->803 824 num_voted_users <= 15291.0 gini = 0.341 samples = 197 value = [0, 43, 154, 0] 802->824 804 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 803->804 805 movie_facebook_likes <= 774.5 gini = 0.499 samples = 29 value = [0, 15, 14, 0] 803->805 806 num_voted_users <= 4530.5 gini = 0.463 samples = 22 value = [0, 14, 8, 0] 805->806 819 num_user_for_reviews <= 58.0 gini = 0.245 samples = 7 value = [0, 1, 6, 0] 805->819 807 num_user_for_reviews <= 57.0 gini = 0.408 samples = 7 value = [0, 2, 5, 0] 806->807 810 num_voted_users <= 21284.0 gini = 0.32 samples = 15 value = [0, 12, 3, 0] 806->810 808 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 807->808 809 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 807->809 811 num_critic_for_reviews <= 56.5 gini = 0.245 samples = 14 value = [0, 12, 2, 0] 810->811 818 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 810->818 812 num_critic_for_reviews <= 34.0 gini = 0.444 samples = 6 value = [0, 4, 2, 0] 811->812 817 gini = 0.0 samples = 8 value = [0, 8, 0, 0] 811->817 813 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 812->813 814 duration <= 119.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 812->814 815 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 814->815 816 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 814->816 820 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 819->820 821 duration <= 118.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 819->821 822 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 821->822 823 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 821->823 825 movie_facebook_likes <= 985.0 gini = 0.176 samples = 41 value = [0, 4, 37, 0] 824->825 844 num_voted_users <= 40041.0 gini = 0.375 samples = 156 value = [0, 39, 117, 0] 824->844 826 movie_facebook_likes <= 90.0 gini = 0.145 samples = 38 value = [0, 3, 35, 0] 825->826 841 duration <= 115.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 825->841 827 num_voted_users <= 11026.5 gini = 0.236 samples = 22 value = [0, 3, 19, 0] 826->827 840 gini = 0.0 samples = 16 value = [0, 0, 16, 0] 826->840 828 num_voted_users <= 10517.0 gini = 0.408 samples = 7 value = [0, 2, 5, 0] 827->828 835 duration <= 122.5 gini = 0.124 samples = 15 value = [0, 1, 14, 0] 827->835 829 duration <= 119.0 gini = 0.278 samples = 6 value = [0, 1, 5, 0] 828->829 834 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 828->834 830 num_critic_for_reviews <= 114.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 829->830 833 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 829->833 831 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 830->831 832 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 830->832 836 gini = 0.0 samples = 13 value = [0, 0, 13, 0] 835->836 837 num_voted_users <= 14192.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 835->837 838 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 837->838 839 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 837->839 842 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 841->842 843 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 841->843 845 duration <= 113.5 gini = 0.444 samples = 87 value = [0, 29, 58, 0] 844->845 894 num_user_for_reviews <= 95.0 gini = 0.248 samples = 69 value = [0, 10, 59, 0] 844->894 846 num_critic_for_reviews <= 39.5 gini = 0.219 samples = 24 value = [0, 3, 21, 0] 845->846 855 num_critic_for_reviews <= 227.0 gini = 0.485 samples = 63 value = [0, 26, 37, 0] 845->855 847 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 846->847 848 num_voted_users <= 39049.0 gini = 0.159 samples = 23 value = [0, 2, 21, 0] 846->848 849 num_voted_users <= 16013.5 gini = 0.087 samples = 22 value = [0, 1, 21, 0] 848->849 854 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 848->854 850 movie_facebook_likes <= 424.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 849->850 853 gini = 0.0 samples = 20 value = [0, 0, 20, 0] 849->853 851 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 850->851 852 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 850->852 856 movie_facebook_likes <= 15000.0 gini = 0.495 samples = 58 value = [0, 26, 32, 0] 855->856 893 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 855->893 857 num_critic_for_reviews <= 102.0 gini = 0.487 samples = 55 value = [0, 23, 32, 0] 856->857 892 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 856->892 858 num_voted_users <= 37024.5 gini = 0.491 samples = 23 value = [0, 13, 10, 0] 857->858 875 num_critic_for_reviews <= 122.5 gini = 0.43 samples = 32 value = [0, 10, 22, 0] 857->875 859 num_voted_users <= 22571.0 gini = 0.472 samples = 21 value = [0, 13, 8, 0] 858->859 874 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 858->874 860 num_voted_users <= 21603.5 gini = 0.48 samples = 10 value = [0, 4, 6, 0] 859->860 869 num_voted_users <= 29129.5 gini = 0.298 samples = 11 value = [0, 9, 2, 0] 859->869 861 num_user_for_reviews <= 204.5 gini = 0.49 samples = 7 value = [0, 4, 3, 0] 860->861 868 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 860->868 862 num_critic_for_reviews <= 73.0 gini = 0.32 samples = 5 value = [0, 4, 1, 0] 861->862 867 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 861->867 863 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 862->863 864 num_critic_for_reviews <= 87.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 862->864 865 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 864->865 866 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 864->866 870 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 869->870 871 num_critic_for_reviews <= 77.5 gini = 0.48 samples = 5 value = [0, 3, 2, 0] 869->871 872 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 871->872 873 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 871->873 876 num_critic_for_reviews <= 110.5 gini = 0.142 samples = 13 value = [0, 1, 12, 0] 875->876 881 num_voted_users <= 25884.0 gini = 0.499 samples = 19 value = [0, 9, 10, 0] 875->881 877 num_voted_users <= 24324.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 876->877 880 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 876->880 878 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 877->878 879 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 877->879 882 duration <= 116.0 gini = 0.245 samples = 7 value = [0, 1, 6, 0] 881->882 885 duration <= 117.0 gini = 0.444 samples = 12 value = [0, 8, 4, 0] 881->885 883 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 882->883 884 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 882->884 886 movie_facebook_likes <= 719.5 gini = 0.5 samples = 8 value = [0, 4, 4, 0] 885->886 891 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 885->891 887 num_voted_users <= 28506.5 gini = 0.444 samples = 6 value = [0, 2, 4, 0] 886->887 890 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 886->890 888 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 887->888 889 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 887->889 895 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 894->895 896 num_user_for_reviews <= 171.5 gini = 0.23 samples = 68 value = [0, 9, 59, 0] 894->896 897 num_user_for_reviews <= 143.5 gini = 0.358 samples = 30 value = [0, 7, 23, 0] 896->897 908 num_critic_for_reviews <= 151.0 gini = 0.1 samples = 38 value = [0, 2, 36, 0] 896->908 898 gini = 0.0 samples = 18 value = [0, 0, 18, 0] 897->898 899 num_voted_users <= 43293.5 gini = 0.486 samples = 12 value = [0, 7, 5, 0] 897->899 900 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 899->900 901 duration <= 113.5 gini = 0.42 samples = 10 value = [0, 7, 3, 0] 899->901 902 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 901->902 903 num_critic_for_reviews <= 123.5 gini = 0.5 samples = 6 value = [0, 3, 3, 0] 901->903 904 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 903->904 905 num_voted_users <= 45095.5 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 903->905 906 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 905->906 907 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 905->907 909 gini = 0.0 samples = 27 value = [0, 0, 27, 0] 908->909 910 num_critic_for_reviews <= 163.5 gini = 0.298 samples = 11 value = [0, 2, 9, 0] 908->910 911 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 910->911 912 duration <= 112.5 gini = 0.18 samples = 10 value = [0, 1, 9, 0] 910->912 913 num_critic_for_reviews <= 188.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 912->913 916 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 912->916 914 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 913->914 915 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 913->915 918 duration <= 213.5 gini = 0.126 samples = 90 value = [0, 4, 84, 2] 917->918 939 num_voted_users <= 17617.0 gini = 0.381 samples = 58 value = [1, 12, 44, 1] 917->939 919 duration <= 128.5 gini = 0.109 samples = 87 value = [0, 4, 82, 1] 918->919 936 duration <= 220.0 gini = 0.444 samples = 3 value = [0, 0, 2, 1] 918->936 920 num_voted_users <= 55183.5 gini = 0.266 samples = 19 value = [0, 3, 16, 0] 919->920 927 duration <= 167.0 gini = 0.058 samples = 68 value = [0, 1, 66, 1] 919->927 921 duration <= 127.5 gini = 0.198 samples = 18 value = [0, 2, 16, 0] 920->921 926 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 920->926 922 gini = 0.0 samples = 14 value = [0, 0, 14, 0] 921->922 923 movie_facebook_likes <= 340.0 gini = 0.5 samples = 4 value = [0, 2, 2, 0] 921->923 924 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 923->924 925 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 923->925 928 num_critic_for_reviews <= 38.0 gini = 0.033 samples = 60 value = [0, 1, 59, 0] 927->928 933 num_user_for_reviews <= 92.0 gini = 0.219 samples = 8 value = [0, 0, 7, 1] 927->933 929 num_critic_for_reviews <= 33.0 gini = 0.18 samples = 10 value = [0, 1, 9, 0] 928->929 932 gini = 0.0 samples = 50 value = [0, 0, 50, 0] 928->932 930 gini = 0.0 samples = 9 value = [0, 0, 9, 0] 929->930 931 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 929->931 934 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 933->934 935 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 933->935 937 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 936->937 938 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 936->938 940 duration <= 147.5 gini = 0.494 samples = 9 value = [0, 5, 4, 0] 939->940 945 movie_facebook_likes <= 23000.0 gini = 0.312 samples = 49 value = [1, 7, 40, 1] 939->945 941 movie_facebook_likes <= 320.0 gini = 0.32 samples = 5 value = [0, 1, 4, 0] 940->941 944 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 940->944 942 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 941->942 943 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 941->943 946 duration <= 134.5 gini = 0.268 samples = 46 value = [1, 5, 39, 1] 945->946 969 num_user_for_reviews <= 246.0 gini = 0.444 samples = 3 value = [0, 2, 1, 0] 945->969 947 duration <= 132.5 gini = 0.392 samples = 24 value = [1, 5, 18, 0] 946->947 964 num_voted_users <= 32148.0 gini = 0.087 samples = 22 value = [0, 0, 21, 1] 946->964 948 num_critic_for_reviews <= 111.0 gini = 0.255 samples = 20 value = [0, 3, 17, 0] 947->948 959 num_voted_users <= 36276.5 gini = 0.625 samples = 4 value = [1, 2, 1, 0] 947->959 949 num_critic_for_reviews <= 94.5 gini = 0.397 samples = 11 value = [0, 3, 8, 0] 948->949 958 gini = 0.0 samples = 9 value = [0, 0, 9, 0] 948->958 950 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 949->950 951 num_user_for_reviews <= 177.5 gini = 0.5 samples = 6 value = [0, 3, 3, 0] 949->951 952 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 951->952 953 duration <= 126.5 gini = 0.48 samples = 5 value = [0, 3, 2, 0] 951->953 954 num_voted_users <= 39032.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 953->954 957 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 953->957 955 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 954->955 956 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 954->956 960 num_critic_for_reviews <= 171.5 gini = 0.5 samples = 2 value = [1, 0, 1, 0] 959->960 963 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 959->963 961 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 960->961 962 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 960->962 965 num_voted_users <= 31511.5 gini = 0.198 samples = 9 value = [0, 0, 8, 1] 964->965 968 gini = 0.0 samples = 13 value = [0, 0, 13, 0] 964->968 966 gini = 0.0 samples = 8 value = [0, 0, 8, 0] 965->966 967 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 965->967 970 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 969->970 971 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 969->971 973 num_voted_users <= 56534.0 gini = 0.568 samples = 26 value = [2, 12, 12, 0] 972->973 992 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 972->992 974 num_user_for_reviews <= 433.5 gini = 0.57 samples = 16 value = [2, 9, 5, 0] 973->974 985 num_user_for_reviews <= 405.0 gini = 0.42 samples = 10 value = [0, 3, 7, 0] 973->985 975 num_critic_for_reviews <= 154.0 gini = 0.628 samples = 11 value = [2, 4, 5, 0] 974->975 984 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 974->984 976 num_voted_users <= 45243.0 gini = 0.625 samples = 8 value = [2, 4, 2, 0] 975->976 983 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 975->983 977 num_critic_for_reviews <= 129.5 gini = 0.64 samples = 5 value = [2, 1, 2, 0] 976->977 982 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 976->982 978 num_user_for_reviews <= 390.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 977->978 981 gini = 0.0 samples = 2 value = [2, 0, 0, 0] 977->981 979 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 978->979 980 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 978->980 986 duration <= 122.5 gini = 0.48 samples = 5 value = [0, 3, 2, 0] 985->986 991 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 985->991 987 num_critic_for_reviews <= 185.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 986->987 990 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 986->990 988 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 987->988 989 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 987->989 994 num_voted_users <= 74324.0 gini = 0.051 samples = 77 value = [0, 2, 75, 0] 993->994 1003 num_critic_for_reviews <= 188.5 gini = 0.319 samples = 82 value = [0, 15, 66, 1] 993->1003 995 num_voted_users <= 74141.5 gini = 0.147 samples = 25 value = [0, 2, 23, 0] 994->995 1002 gini = 0.0 samples = 52 value = [0, 0, 52, 0] 994->1002 996 num_critic_for_reviews <= 130.5 gini = 0.08 samples = 24 value = [0, 1, 23, 0] 995->996 1001 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 995->1001 997 gini = 0.0 samples = 20 value = [0, 0, 20, 0] 996->997 998 num_critic_for_reviews <= 138.5 gini = 0.375 samples = 4 value = [0, 1, 3, 0] 996->998 999 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 998->999 1000 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 998->1000 1004 num_voted_users <= 94613.5 gini = 0.497 samples = 24 value = [0, 8, 15, 1] 1003->1004 1015 num_user_for_reviews <= 425.5 gini = 0.212 samples = 58 value = [0, 7, 51, 0] 1003->1015 1005 num_voted_users <= 77840.5 gini = 0.431 samples = 21 value = [0, 5, 15, 1] 1004->1005 1014 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 1004->1014 1006 num_user_for_reviews <= 206.0 gini = 0.579 samples = 11 value = [0, 5, 5, 1] 1005->1006 1013 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 1005->1013 1007 num_voted_users <= 66638.0 gini = 0.375 samples = 4 value = [0, 3, 0, 1] 1006->1007 1010 duration <= 123.5 gini = 0.408 samples = 7 value = [0, 2, 5, 0] 1006->1010 1008 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1007->1008 1009 gini = 0.0 samples = 3 value = [0, 3, 0, 0] 1007->1009 1011 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 1010->1011 1012 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 1010->1012 1016 duration <= 111.5 gini = 0.113 samples = 50 value = [0, 3, 47, 0] 1015->1016 1027 duration <= 117.0 gini = 0.5 samples = 8 value = [0, 4, 4, 0] 1015->1027 1017 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1016->1017 1018 movie_facebook_likes <= 42000.0 gini = 0.078 samples = 49 value = [0, 2, 47, 0] 1016->1018 1019 num_user_for_reviews <= 201.0 gini = 0.043 samples = 46 value = [0, 1, 45, 0] 1018->1019 1024 num_critic_for_reviews <= 281.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 1018->1024 1020 num_user_for_reviews <= 196.0 gini = 0.165 samples = 11 value = [0, 1, 10, 0] 1019->1020 1023 gini = 0.0 samples = 35 value = [0, 0, 35, 0] 1019->1023 1021 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 1020->1021 1022 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1020->1022 1025 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 1024->1025 1026 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1024->1026 1028 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 1027->1028 1029 num_user_for_reviews <= 591.5 gini = 0.32 samples = 5 value = [0, 4, 1, 0] 1027->1029 1030 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 1029->1030 1031 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1029->1031 1033 num_user_for_reviews <= 821.0 gini = 0.444 samples = 9 value = [0, 6, 3, 0] 1032->1033 1040 movie_facebook_likes <= 3500.0 gini = 0.32 samples = 5 value = [0, 1, 4, 0] 1032->1040 1034 num_voted_users <= 101978.5 gini = 0.278 samples = 6 value = [0, 5, 1, 0] 1033->1034 1037 num_user_for_reviews <= 895.5 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 1033->1037 1035 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 1034->1035 1036 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1034->1036 1038 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 1037->1038 1039 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1037->1039 1041 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 1040->1041 1042 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1040->1042 1044 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 1043->1044 1045 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 1043->1045 1047 duration <= 221.0 gini = 0.253 samples = 687 value = [1, 50, 590, 46] 1046->1047 1244 num_critic_for_reviews <= 286.5 gini = 0.296 samples = 61 value = [0, 0, 11, 50] 1046->1244 1048 num_voted_users <= 405996.5 gini = 0.243 samples = 681 value = [1, 50, 589, 41] 1047->1048 1241 num_user_for_reviews <= 404.0 gini = 0.278 samples = 6 value = [0, 0, 1, 5] 1047->1241 1049 movie_facebook_likes <= 71500.0 gini = 0.218 samples = 624 value = [1, 50, 549, 24] 1048->1049 1214 num_critic_for_reviews <= 253.0 gini = 0.419 samples = 57 value = [0, 0, 40, 17] 1048->1214 1050 num_user_for_reviews <= 1217.0 gini = 0.207 samples = 599 value = [1, 49, 531, 18] 1049->1050 1203 movie_facebook_likes <= 81000.0 gini = 0.422 samples = 25 value = [0, 1, 18, 6] 1049->1203 1051 num_voted_users <= 192187.0 gini = 0.194 samples = 557 value = [1, 40, 498, 18] 1050->1051 1190 num_user_for_reviews <= 1446.5 gini = 0.337 samples = 42 value = [0, 9, 33, 0] 1050->1190 1052 movie_facebook_likes <= 61000.0 gini = 0.223 samples = 336 value = [1, 35, 294, 6] 1051->1052 1151 duration <= 177.0 gini = 0.144 samples = 221 value = [0, 5, 204, 12] 1051->1151 1053 duration <= 127.5 gini = 0.208 samples = 329 value = [1, 31, 291, 6] 1052->1053 1148 movie_facebook_likes <= 66500.0 gini = 0.49 samples = 7 value = [0, 4, 3, 0] 1052->1148 1054 num_user_for_reviews <= 244.0 gini = 0.249 samples = 261 value = [1, 31, 224, 5] 1053->1054 1143 num_user_for_reviews <= 234.0 gini = 0.029 samples = 68 value = [0, 0, 67, 1] 1053->1143 1055 duration <= 87.5 gini = 0.036 samples = 54 value = [0, 0, 53, 1] 1054->1055 1060 num_voted_users <= 141419.5 gini = 0.295 samples = 207 value = [1, 31, 171, 4] 1054->1060 1056 duration <= 85.5 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 1055->1056 1059 gini = 0.0 samples = 52 value = [0, 0, 52, 0] 1055->1059 1057 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1056->1057 1058 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1056->1058 1061 num_voted_users <= 139221.5 gini = 0.35 samples = 93 value = [0, 21, 72, 0] 1060->1061 1102 num_voted_users <= 191691.0 gini = 0.237 samples = 114 value = [1, 10, 99, 4] 1060->1102 1062 num_user_for_reviews <= 899.5 gini = 0.293 samples = 84 value = [0, 15, 69, 0] 1061->1062 1093 num_critic_for_reviews <= 169.5 gini = 0.444 samples = 9 value = [0, 6, 3, 0] 1061->1093 1063 duration <= 84.5 gini = 0.267 samples = 82 value = [0, 13, 69, 0] 1062->1063 1092 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 1062->1092 1064 num_user_for_reviews <= 322.0 gini = 0.494 samples = 9 value = [0, 4, 5, 0] 1063->1064 1069 num_user_for_reviews <= 257.5 gini = 0.216 samples = 73 value = [0, 9, 64, 0] 1063->1069 1065 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 1064->1065 1066 duration <= 59.0 gini = 0.32 samples = 5 value = [0, 4, 1, 0] 1064->1066 1067 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1066->1067 1068 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 1066->1068 1070 num_voted_users <= 119244.5 gini = 0.5 samples = 4 value = [0, 2, 2, 0] 1069->1070 1073 num_voted_users <= 131111.5 gini = 0.182 samples = 69 value = [0, 7, 62, 0] 1069->1073 1071 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 1070->1071 1072 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 1070->1072 1074 num_voted_users <= 110615.0 gini = 0.075 samples = 51 value = [0, 2, 49, 0] 1073->1074 1079 num_voted_users <= 132958.5 gini = 0.401 samples = 18 value = [0, 5, 13, 0] 1073->1079 1075 num_voted_users <= 110438.0 gini = 0.375 samples = 8 value = [0, 2, 6, 0] 1074->1075 1078 gini = 0.0 samples = 43 value = [0, 0, 43, 0] 1074->1078 1076 gini = 0.0 samples = 6 value = [0, 0, 6, 0] 1075->1076 1077 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 1075->1077 1080 movie_facebook_likes <= 6000.0 gini = 0.48 samples = 5 value = [0, 3, 2, 0] 1079->1080 1085 num_critic_for_reviews <= 420.0 gini = 0.26 samples = 13 value = [0, 2, 11, 0] 1079->1085 1081 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 1080->1081 1082 num_critic_for_reviews <= 242.0 gini = 0.444 samples = 3 value = [0, 1, 2, 0] 1080->1082 1083 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 1082->1083 1084 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1082->1084 1086 movie_facebook_likes <= 37500.0 gini = 0.153 samples = 12 value = [0, 1, 11, 0] 1085->1086 1091 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1085->1091 1087 gini = 0.0 samples = 10 value = [0, 0, 10, 0] 1086->1087 1088 duration <= 107.5 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 1086->1088 1089 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1088->1089 1090 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1088->1090 1094 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1093->1094 1095 num_critic_for_reviews <= 423.0 gini = 0.375 samples = 8 value = [0, 6, 2, 0] 1093->1095 1096 duration <= 118.5 gini = 0.245 samples = 7 value = [0, 6, 1, 0] 1095->1096 1101 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1095->1101 1097 gini = 0.0 samples = 5 value = [0, 5, 0, 0] 1096->1097 1098 movie_facebook_likes <= 22000.0 gini = 0.5 samples = 2 value = [0, 1, 1, 0] 1096->1098 1099 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1098->1099 1100 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1098->1100 1103 num_user_for_reviews <= 268.5 gini = 0.225 samples = 113 value = [1, 9, 99, 4] 1102->1103 1142 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1102->1142 1104 num_critic_for_reviews <= 286.5 gini = 0.531 samples = 8 value = [0, 1, 5, 2] 1103->1104 1109 num_voted_users <= 173129.0 gini = 0.192 samples = 105 value = [1, 8, 94, 2] 1103->1109 1105 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 1104->1105 1106 duration <= 107.0 gini = 0.444 samples = 3 value = [0, 1, 0, 2] 1104->1106 1107 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1106->1107 1108 gini = 0.0 samples = 2 value = [0, 0, 0, 2] 1106->1108 1110 num_voted_users <= 146740.0 gini = 0.086 samples = 67 value = [0, 3, 64, 0] 1109->1110 1121 num_voted_users <= 174584.5 gini = 0.356 samples = 38 value = [1, 5, 30, 2] 1109->1121 1111 num_voted_users <= 146163.0 gini = 0.26 samples = 13 value = [0, 2, 11, 0] 1110->1111 1116 duration <= 112.0 gini = 0.036 samples = 54 value = [0, 1, 53, 0] 1110->1116 1112 duration <= 97.5 gini = 0.153 samples = 12 value = [0, 1, 11, 0] 1111->1112 1115 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1111->1115 1113 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1112->1113 1114 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 1112->1114 1117 gini = 0.0 samples = 37 value = [0, 0, 37, 0] 1116->1117 1118 duration <= 113.5 gini = 0.111 samples = 17 value = [0, 1, 16, 0] 1116->1118 1119 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1118->1119 1120 gini = 0.0 samples = 16 value = [0, 0, 16, 0] 1118->1120 1122 num_voted_users <= 174064.5 gini = 0.5 samples = 2 value = [0, 1, 0, 1] 1121->1122 1125 num_user_for_reviews <= 998.0 gini = 0.292 samples = 36 value = [1, 4, 30, 1] 1121->1125 1123 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1122->1123 1124 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1122->1124 1126 duration <= 101.5 gini = 0.251 samples = 35 value = [0, 4, 30, 1] 1125->1126 1141 gini = 0.0 samples = 1 value = [1, 0, 0, 0] 1125->1141 1127 gini = 0.0 samples = 12 value = [0, 0, 12, 0] 1126->1127 1128 duration <= 105.5 gini = 0.355 samples = 23 value = [0, 4, 18, 1] 1126->1128 1129 movie_facebook_likes <= 6000.0 gini = 0.611 samples = 6 value = [0, 2, 3, 1] 1128->1129 1134 duration <= 120.0 gini = 0.208 samples = 17 value = [0, 2, 15, 0] 1128->1134 1130 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 1129->1130 1131 num_critic_for_reviews <= 150.0 gini = 0.444 samples = 3 value = [0, 2, 0, 1] 1129->1131 1132 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1131->1132 1133 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 1131->1133 1135 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 1134->1135 1136 num_user_for_reviews <= 526.5 gini = 0.444 samples = 6 value = [0, 2, 4, 0] 1134->1136 1137 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 1136->1137 1138 num_user_for_reviews <= 716.5 gini = 0.5 samples = 4 value = [0, 2, 2, 0] 1136->1138 1139 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 1138->1139 1140 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 1138->1140 1144 num_critic_for_reviews <= 180.5 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 1143->1144 1147 gini = 0.0 samples = 66 value = [0, 0, 66, 0] 1143->1147 1145 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1144->1145 1146 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1144->1146 1149 gini = 0.0 samples = 4 value = [0, 4, 0, 0] 1148->1149 1150 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 1148->1150 1152 num_critic_for_reviews <= 132.0 gini = 0.123 samples = 216 value = [0, 5, 202, 9] 1151->1152 1187 num_critic_for_reviews <= 212.5 gini = 0.48 samples = 5 value = [0, 0, 2, 3] 1151->1187 1153 num_user_for_reviews <= 493.0 gini = 0.32 samples = 20 value = [0, 0, 16, 4] 1152->1153 1156 num_voted_users <= 193197.5 gini = 0.098 samples = 196 value = [0, 5, 186, 5] 1152->1156 1154 gini = 0.0 samples = 16 value = [0, 0, 16, 0] 1153->1154 1155 gini = 0.0 samples = 4 value = [0, 0, 0, 4] 1153->1155 1157 movie_facebook_likes <= 6000.0 gini = 0.444 samples = 3 value = [0, 0, 2, 1] 1156->1157 1160 movie_facebook_likes <= 12500.0 gini = 0.09 samples = 193 value = [0, 5, 184, 4] 1156->1160 1158 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1157->1158 1159 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 1157->1159 1161 gini = 0.0 samples = 89 value = [0, 0, 89, 0] 1160->1161 1162 movie_facebook_likes <= 13500.0 gini = 0.162 samples = 104 value = [0, 5, 95, 4] 1160->1162 1163 num_voted_users <= 217424.5 gini = 0.625 samples = 4 value = [0, 2, 1, 1] 1162->1163 1168 num_voted_users <= 355995.5 gini = 0.115 samples = 100 value = [0, 3, 94, 3] 1162->1168 1164 duration <= 121.5 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 1163->1164 1167 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 1163->1167 1165 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1164->1165 1166 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1164->1166 1169 num_voted_users <= 229755.0 gini = 0.085 samples = 91 value = [0, 3, 87, 1] 1168->1169 1184 num_voted_users <= 374795.0 gini = 0.346 samples = 9 value = [0, 0, 7, 2] 1168->1184 1170 num_voted_users <= 229120.5 gini = 0.17 samples = 32 value = [0, 3, 29, 0] 1169->1170 1179 num_critic_for_reviews <= 186.0 gini = 0.033 samples = 59 value = [0, 0, 58, 1] 1169->1179 1171 num_critic_for_reviews <= 374.0 gini = 0.121 samples = 31 value = [0, 2, 29, 0] 1170->1171 1178 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1170->1178 1172 gini = 0.0 samples = 18 value = [0, 0, 18, 0] 1171->1172 1173 movie_facebook_likes <= 28000.0 gini = 0.26 samples = 13 value = [0, 2, 11, 0] 1171->1173 1174 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1173->1174 1175 num_critic_for_reviews <= 382.0 gini = 0.153 samples = 12 value = [0, 1, 11, 0] 1173->1175 1176 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1175->1176 1177 gini = 0.0 samples = 11 value = [0, 0, 11, 0] 1175->1177 1180 num_critic_for_reviews <= 171.0 gini = 0.375 samples = 4 value = [0, 0, 3, 1] 1179->1180 1183 gini = 0.0 samples = 55 value = [0, 0, 55, 0] 1179->1183 1181 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 1180->1181 1182 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1180->1182 1185 gini = 0.0 samples = 2 value = [0, 0, 0, 2] 1184->1185 1186 gini = 0.0 samples = 7 value = [0, 0, 7, 0] 1184->1186 1188 gini = 0.0 samples = 3 value = [0, 0, 0, 3] 1187->1188 1189 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 1187->1189 1191 num_voted_users <= 230844.5 gini = 0.497 samples = 13 value = [0, 7, 6, 0] 1190->1191 1200 movie_facebook_likes <= 18500.0 gini = 0.128 samples = 29 value = [0, 2, 27, 0] 1190->1200 1192 num_voted_users <= 126007.5 gini = 0.375 samples = 8 value = [0, 6, 2, 0] 1191->1192 1197 duration <= 147.5 gini = 0.32 samples = 5 value = [0, 1, 4, 0] 1191->1197 1193 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1192->1193 1194 num_critic_for_reviews <= 175.5 gini = 0.245 samples = 7 value = [0, 6, 1, 0] 1192->1194 1195 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1194->1195 1196 gini = 0.0 samples = 6 value = [0, 6, 0, 0] 1194->1196 1198 gini = 0.0 samples = 4 value = [0, 0, 4, 0] 1197->1198 1199 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1197->1199 1201 gini = 0.0 samples = 27 value = [0, 0, 27, 0] 1200->1201 1202 gini = 0.0 samples = 2 value = [0, 2, 0, 0] 1200->1202 1204 num_voted_users <= 140608.0 gini = 0.408 samples = 7 value = [0, 0, 2, 5] 1203->1204 1209 num_user_for_reviews <= 361.0 gini = 0.204 samples = 18 value = [0, 1, 16, 1] 1203->1209 1205 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1204->1205 1206 duration <= 106.0 gini = 0.278 samples = 6 value = [0, 0, 1, 5] 1204->1206 1207 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1206->1207 1208 gini = 0.0 samples = 5 value = [0, 0, 0, 5] 1206->1208 1210 gini = 0.0 samples = 1 value = [0, 1, 0, 0] 1209->1210 1211 num_voted_users <= 397355.5 gini = 0.111 samples = 17 value = [0, 0, 16, 1] 1209->1211 1212 gini = 0.0 samples = 16 value = [0, 0, 16, 0] 1211->1212 1213 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1211->1213 1215 gini = 0.0 samples = 7 value = [0, 0, 0, 7] 1214->1215 1216 movie_facebook_likes <= 151500.0 gini = 0.32 samples = 50 value = [0, 0, 40, 10] 1214->1216 1217 num_critic_for_reviews <= 303.0 gini = 0.278 samples = 48 value = [0, 0, 40, 8] 1216->1217 1240 gini = 0.0 samples = 2 value = [0, 0, 0, 2] 1216->1240 1218 movie_facebook_likes <= 7000.0 gini = 0.5 samples = 8 value = [0, 0, 4, 4] 1217->1218 1223 movie_facebook_likes <= 135500.0 gini = 0.18 samples = 40 value = [0, 0, 36, 4] 1217->1223 1219 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 1218->1219 1220 num_user_for_reviews <= 1369.5 gini = 0.32 samples = 5 value = [0, 0, 1, 4] 1218->1220 1221 gini = 0.0 samples = 4 value = [0, 0, 0, 4] 1220->1221 1222 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1220->1222 1224 duration <= 116.0 gini = 0.145 samples = 38 value = [0, 0, 35, 3] 1223->1224 1237 movie_facebook_likes <= 149500.0 gini = 0.5 samples = 2 value = [0, 0, 1, 1] 1223->1237 1225 duration <= 114.0 gini = 0.278 samples = 12 value = [0, 0, 10, 2] 1224->1225 1232 num_voted_users <= 439630.0 gini = 0.074 samples = 26 value = [0, 0, 25, 1] 1224->1232 1226 num_user_for_reviews <= 1028.5 gini = 0.165 samples = 11 value = [0, 0, 10, 1] 1225->1226 1231 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1225->1231 1227 gini = 0.0 samples = 8 value = [0, 0, 8, 0] 1226->1227 1228 num_user_for_reviews <= 1173.5 gini = 0.444 samples = 3 value = [0, 0, 2, 1] 1226->1228 1229 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1228->1229 1230 gini = 0.0 samples = 2 value = [0, 0, 2, 0] 1228->1230 1233 movie_facebook_likes <= 79000.0 gini = 0.375 samples = 4 value = [0, 0, 3, 1] 1232->1233 1236 gini = 0.0 samples = 22 value = [0, 0, 22, 0] 1232->1236 1234 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 1233->1234 1235 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1233->1235 1238 gini = 0.0 samples = 1 value = [0, 0, 0, 1] 1237->1238 1239 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1237->1239 1242 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1241->1242 1243 gini = 0.0 samples = 5 value = [0, 0, 0, 5] 1241->1243 1245 gini = 0.0 samples = 25 value = [0, 0, 0, 25] 1244->1245 1246 num_voted_users <= 710222.0 gini = 0.424 samples = 36 value = [0, 0, 11, 25] 1244->1246 1247 movie_facebook_likes <= 16000.0 gini = 0.498 samples = 17 value = [0, 0, 9, 8] 1246->1247 1256 duration <= 175.5 gini = 0.188 samples = 19 value = [0, 0, 2, 17] 1246->1256 1248 gini = 0.0 samples = 3 value = [0, 0, 3, 0] 1247->1248 1249 num_critic_for_reviews <= 502.0 gini = 0.49 samples = 14 value = [0, 0, 6, 8] 1247->1249 1250 num_user_for_reviews <= 665.0 gini = 0.245 samples = 7 value = [0, 0, 1, 6] 1249->1250 1253 movie_facebook_likes <= 143000.0 gini = 0.408 samples = 7 value = [0, 0, 5, 2] 1249->1253 1251 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1250->1251 1252 gini = 0.0 samples = 6 value = [0, 0, 0, 6] 1250->1252 1254 gini = 0.0 samples = 5 value = [0, 0, 5, 0] 1253->1254 1255 gini = 0.0 samples = 2 value = [0, 0, 0, 2] 1253->1255 1257 gini = 0.0 samples = 15 value = [0, 0, 0, 15] 1256->1257 1258 num_critic_for_reviews <= 321.5 gini = 0.5 samples = 4 value = [0, 0, 2, 2] 1256->1258 1259 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1258->1259 1260 num_critic_for_reviews <= 664.5 gini = 0.444 samples = 3 value = [0, 0, 1, 2] 1258->1260 1261 gini = 0.0 samples = 2 value = [0, 0, 0, 2] 1260->1261 1262 gini = 0.0 samples = 1 value = [0, 0, 1, 0] 1260->1262

b) KNN

In [116]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
Out[116]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')
In [117]:
print(metrics.accuracy_score(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, knn.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
#print(metrics.roc_auc_score(y_test, knn.predict(X_test)))
0.631154879140555
--------------------------------------------------------
[[  1  12  17   0]
 [  3  83 214   0]
 [  4 131 596   3]
 [  0   0  28  25]]
--------------------------------------------------------
              precision    recall  f1-score   support

           1       0.12      0.03      0.05        30
           2       0.37      0.28      0.32       300
           3       0.70      0.81      0.75       734
           4       0.89      0.47      0.62        53

   micro avg       0.63      0.63      0.63      1117
   macro avg       0.52      0.40      0.43      1117
weighted avg       0.60      0.63      0.61      1117

--------------------------------------------------------
In [118]:
k_range = range(1, 10)
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores.append(np.mean(cross_val_score(knn, X, y, cv=10, scoring = 'accuracy')))
    
    
    
#plt.figure()    
plt.plot(k_range, scores)
plt.xlabel('k value')
plt.ylabel('accuracy')
Out[118]:
Text(0, 0.5, 'accuracy')

this graph shows the best number for k is 3 here.

c) GradientBoostingClassifier

In [119]:
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
In [120]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
gb = GradientBoostingClassifier(n_estimators=100, random_state=0)
gb.fit(X_train, y_train)
Out[120]:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=0,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)
In [121]:
print(metrics.accuracy_score(y_test, gb.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, gb.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, gb.predict(X_test)))
print("--------------------------------------------------------")
0.6956132497761862
--------------------------------------------------------
[[  0  19  11   0]
 [  2 128 170   0]
 [  1 106 622   5]
 [  0   1  25  27]]
--------------------------------------------------------
              precision    recall  f1-score   support

           1       0.00      0.00      0.00        30
           2       0.50      0.43      0.46       300
           3       0.75      0.85      0.80       734
           4       0.84      0.51      0.64        53

   micro avg       0.70      0.70      0.70      1117
   macro avg       0.52      0.45      0.47      1117
weighted avg       0.67      0.70      0.68      1117

--------------------------------------------------------
In [122]:
# 10-fold cross-validation

scores = cross_val_score(gb, X, y, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())
[0.66133333 0.73726542 0.73726542 0.73726542 0.76675603 0.71505376
 0.72237197 0.69272237 0.63072776 0.57681941]
0.697758088502853

d) Support Vector Machine

In [123]:
svm = SVC(gamma='scale', probability=True)
svm.fit(X_train, y_train)
Out[123]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
In [124]:
print(metrics.accuracy_score(y_test, svm.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, svm.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, svm.predict(X_test)))
print("--------------------------------------------------------")
0.6768128916741272
--------------------------------------------------------
[[  0   0  30   0]
 [  0   0 300   0]
 [  0   0 733   1]
 [  0   0  30  23]]
--------------------------------------------------------
              precision    recall  f1-score   support

           1       0.00      0.00      0.00        30
           2       0.00      0.00      0.00       300
           3       0.67      1.00      0.80       734
           4       0.96      0.43      0.60        53

   micro avg       0.68      0.68      0.68      1117
   macro avg       0.41      0.36      0.35      1117
weighted avg       0.49      0.68      0.56      1117

--------------------------------------------------------
C:\Users\mnajjartabar\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\classification.py:1143: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.

In [125]:
# 10-fold cross-validation
svm = SVC(gamma='auto')

scores = cross_val_score(svm, X, y, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())
[0.65066667 0.6541555  0.6541555  0.6541555  0.65683646 0.65591398
 0.65768194 0.65498652 0.65498652 0.65768194]
0.6551220521446673

e) Neural Network

In [126]:
nn = MLPClassifier(solver='lbfgs', max_iter=500)
nn.fit(X_train, y_train)
Out[126]:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=500, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='lbfgs', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)
In [127]:
print(metrics.accuracy_score(y_test, nn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, nn.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, nn.predict(X_test)))
print("--------------------------------------------------------")
0.047448522829006266
--------------------------------------------------------
[[  0   0   0  30]
 [  0   0   0 300]
 [  1   0   0 733]
 [  0   0   0  53]]
--------------------------------------------------------
              precision    recall  f1-score   support

           1       0.00      0.00      0.00        30
           2       0.00      0.00      0.00       300
           3       0.00      0.00      0.00       734
           4       0.05      1.00      0.09        53

   micro avg       0.05      0.05      0.05      1117
   macro avg       0.01      0.25      0.02      1117
weighted avg       0.00      0.05      0.00      1117

--------------------------------------------------------
C:\Users\mnajjartabar\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\classification.py:1143: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.

In this section I ran several classification models. Gradient boosting had the best performance with accruacy close to 70 percent and then decision tree had 62 percent accuracy. SVM gave 67 percent but actually it didn't work well based on confusion matrix. Both SVM and neural network had lazy performance which means they classified all test data set in one category (SVM classified as good and neural network classified as bad).

Clustering

In [128]:
dfff = df.dropna()
In [129]:
dfff = dfff[['gross', 'duration', 'aspect_ratio', 'num_critic_for_reviews', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_1_facebook_likes', 'cast_total_facebook_likes', 'num_user_for_reviews', 'budget', 'actor_2_facebook_likes', 'imdb_score']];
In [130]:
dffff = dfff[['gross', 'budget','num_user_for_reviews','imdb_score']];
In [131]:
k_means = KMeans(init='k-means++', n_clusters=4, random_state=0)
In [132]:
k_means.fit(dfff)
Out[132]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)
In [133]:
from yellowbrick.cluster import SilhouetteVisualizer
from yellowbrick.datasets import load_nfl
visualizer = SilhouetteVisualizer(k_means, colors='yellowbrick')

visualizer.fit(dfff)        # Fit the data to the visualizer
visualizer.show()        
Out[133]:
<matplotlib.axes._subplots.AxesSubplot at 0x26ab04632b0>
In [134]:
from scipy.spatial.distance import cdist 

K = range(1, 10) 

meandistortions = []

for k in K: 
    kmeans = KMeans(n_clusters=k, random_state=1) 
    kmeans.fit(dfff) 
    meandistortions.append(sum(np.min(cdist(dfff, kmeans.cluster_centers_, 'euclidean'), axis=1)) / dfff.shape[0]) 

plt.plot(K, meandistortions, 'bx-') 
plt.xlabel('k') 
plt.ylabel('Average distortion') 
plt.title('Selecting k with the Elbow Method') 
Out[134]:
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
In [135]:
range_n_clusters = list (range(2,10))
print ("Number of clusters from 2 to 9: \n", range_n_clusters)
Number of clusters from 2 to 9: 
 [2, 3, 4, 5, 6, 7, 8, 9]

So from the graph we can see K=4 is the best and actually I had a good guess in first place. However we can consider K=5 as well but it won't improve much!

Story telling

First of all to have a reliable analysis we need a reliable data set and to make sure we have a reliable data set we need to prepare our data. By inspecting data we noticed that we have some missing (as NaN or zero values) and duplicated values. after cleaning our data set shrinked to about 3700 movies which is still a big data set to do analysis. Also almost all of movies are color and just 123 of them are black and white. we can remove color column here but personally I don't like miss any information so I'll keep it.

Here is some analysis:

  • This data set contain movies from several contries. USA, France, Germany, Canada, Australia, Spain, Japan, Hong Kong, China, Italy, New Zealand, South Korea, Denmark, Ireland, Mexico, Brazil, India, Iran and Norway are first 20 countries in this list.
  • Interesting point about countries is India with that big Bollywood has just 5 movies in this list and countries Iran has 4 movies. Just a general information I know Bollywood makes more movies than Hollywood in a year!
  • My client were more interested about movies from Iran so I digged into it and I found the names. There was 4 movies from Iran but 3 movuies in persian and that was because one old movie made in Iran and Afghanestan and mostly in English.
  • Checking profit and IMDB score showed that a movie (Eddie the Eagle) had negative profit with IMDB score of 15 which was really high!!!
  • Avatar had the best profit but The Dark Night had the best IMDB score!
  • A huge number of movies has imdb better than 8 which I named as incredible
  • I didn't see any strong relations between actors facebook likes and imdb score
  • Correlation analysis showed that num_critic_for_reviews, duration, num_voted_users, num_user_for_reviews, and movie_facebook_likes are highly positive correlated to imdb score and title year is negativly and relativly correlated which means people probably won't like the most recent movies as much as they liked old movies but probably fact is each movies takes times to get more viewers and increase imdb score and older movies had enough time for that
  • that num_critic_for_reviews, duration, num_voted_users, num_user_for_reviews, and movie_facebook_likes are highly positive correlated to imdb score and title year is negativly and relativly correlated which means people probably won't like the most recent movies as much as they liked old movies but probably fact is each movies takes times to get more viewers and increase imdb score and older movies had enough time for that.
  • all the above regression models has same r-squared valed (.304) but the full model has the lowest mean squared value which means full model is better. Also all p-values are small so all the selected fatcors are significant.
  • Gradient boosting had the best performance among 5 different classifier that I used with accruacy close to 70 percent and then decision tree had 62 percent accuracy. SVM gave 67 percent but actually it didn't work well based on confusion matrix. Both SVM and neural network had lazy performance which means they classified all test data set in one category (SVM classified as good and neural network classified as bad).
  • Finally we can say to use Kmeans clustering algourithm, the number of clusters is 4.

More analysis has been provided in the report.

In [136]:
    Image(url= "https://s.hdnux.com/photos/01/03/74/10/17809955/3/480x480.png")
Out[136]: